We’ve all read the articles where someone with a massive data set unleashes the power of Big Data and discovers a magical insight about their customers that generates millions in new revenue. Vendors are marketing the power of Hadoop, an important Big Data tool, to find that priceless information nugget buried within petabytes of social media, emails, reviews, clicks and chats. These stories often light the imaginations of business users and help vendors sell new products.
We like to call these Magical Big Data Unicorns—so frequently discussed and pursued, but so rarely seen! We’re not saying they don’t exist, but we would like to suggest another approach to getting solid ROI from a Big Data tool such as Hadoop.
Hadoop can help organizations save big IT dollars in far less dramatic ways. These scenarios don’t make for fascinating stories around a corporate campfire, but they are meaningful and proven techniques for reducing IT costs.
Here we focus on scenarios where Hadoop-based cost reductions could be implemented and measured. They include the ability to:
• Generate faster data transformations that allow “crunch-time” processing to be offloaded to less expensive resources and in acceptable timeframes
• Create and implement active archives for valuable but infrequently accessed data at a far lower price point
• Create and use real-world SQL Hadoop sandboxes that are less expensive than current sample-size testing environments.
Hadoop might not replace an organization’s entire enterprise data warehouse. However, more and more enterprise data warehouse architects now believe Hadoop could assume pieces of the data warehouse workload—and with that hybrid architecture can come significant cost savings.
As data volumes and types grow, SLAs for various projects and phases are put in jeopardy. IT staff are constantly squeezing everywhere they can for more time and cycles to get everything done in tight 24-hour windows. Huge data users were the key drivers for development of the open source Hadoop community in the first place. They needed a faster platform. Early adopters at Yahoo tell of transformations that used to take eight hours shrink to 15 minutes due to Hadoop performance capabilities. Hadoop has the ability to allow IT staff to offload costly manipulation and transformation tasks at a much lower cost.
With Hadoop companies can take advantage of the decline in storage costs across the board. Hadoop can be a very cost-effective alternative to keep older data out of the data warehouse—but still keep it online. These active archives could benefit many organizations that might want or need the ability to query multiyear data sets for patterns and outcomes. Commercial Business Integration (BI) tools are available to use SQL with Hadoop directly, and these tools will continue to evolve and improve. This will accelerate the utilization and interest in these inexpensive archives for data mining and other data science activities.
Data scientists and analysts were big users of small data samples to test things such as predictive analytics. It was an ideal place to explore data. As systems and needs changed, the sandboxes got bigger and predictions and other results improved over time.
Now, with the advent of inexpensive, large data stores, sandboxes can be used for real-world testing, analysis and simulation. Using the features of Hadoop, predictive analytics can be run against enormous quantities of data. If your organization has been using expensive platforms for these sandbox systems, large cost savings can be attained by moving them to Hadoop and its less expensive offline storage.
Hadoop has the ability to significantly lower costs of data management. It can help reduce contention for operational resources and allow cost-effective access to active archive data for longer-term historical analysis. Another benefit is the reduced need to continually add costly, new technology to the data warehouse simply to meet service level agreements. Start small when reviewing where it’s believed Hadoop can play a role in reducing costs. Find time and resource pain points and see if a solution can be built from that perspective.