Assuming you share the “classic” interpretation of Big Data—that the term describes the deliberate massing of data from structured (database) and unstructured (file) sources for the purpose of discerning relationships and insights (via Big Data analytics), then you will probably agree the idea seems to be joined at the hip to another one: Hadoop. At about the same time that Big Data came into popular use in the vernacular of computing, purely by coincidence it seems, there was already a development effort under way to create a parallel programming model capable of supporting data-intensive, distributed applications on clustered X86 processors. That was Hadoop, and Big Data analytics seemed like a natural workload for Hadoop clusters.
Basically, Hadoop borrowed from Google with respect to its MapReduce algorithms and file system, blending it with other open source elements, to create a parallel processing cluster architecture. Apache ran with the basic technology, combined it with its own kernel operating system, a lot of Java and a modified distributed file system to productize Hadoop. They’ve focused on performance, of course, winning a Terabyte Sort Benchmark in 2008 (sorting 1 TB of data in 209 seconds on a cluster), and emphasize the resiliency of their architecture via inter-nodal failover that has managed to convince a lot of firms that Hadoop is “enterprise ready.”
So, Hadoop has popped up in most of the Fortune 500 companies that are pursuing Big Data analytics like a Holy Grail of opportunity—mainly, from what I can glean, to gather insights that will enable better targeting of product marketing campaigns à la Google and Facebook. Some of these efforts make the NSA’s phone record metadata collection efforts look like small potatoes.
Some folks are concerned about privacy in Big Data projects. What’s needed, according to IBM chief scientist Jeff Jonas, a leading expert in Big Data analytics, is a “one-way hash”—a way to provide data that still has research value, but that can’t be “reverse-engineered” to discover certain detailed information best kept private. Says Jonas, “A one-way hash is like giving the user some sausage and a grinder, with the certainty he won’t be able to use them to re-create the pig.”
Privacy is a general concern in large-scale data collection and analysis in a number of fields, especially finance and healthcare. There’s a big fear in some corners that we’re forgetting Thoreau’s warning not to become “the tools of our tools.” However, among the broader community of Internet users, privacy concerns barely register.
What does worry me, however, is the actual resiliency of Hadoop. Everyone understands failover clustering, which works fine if failover logic is well-defined, status monitoring is perfect and equipment, software and data are kept perfectly synchronized. Those challenges assume, however, that the clustering software itself is free of single points of failure. Not so, Hadoop. Documented single points of failure include namenode, which stores metadata about the Hadoop cluster nodes, and JobTracker, which is used to manage MapReduce tasks and to assign them to servers in closest proximity to stored data. Neither of these software components are distributed nor replicated at present.
Another glaring disaster potential in Hadoop is linked to challenges in managing an ever-growing complex of servers. Almost everyone operating Hadoop clusters complains that performance requires adding more servers to clusters, more direct-attached Flash storage and more third-party software tools. Even with Apache’s ZooKeeper management tools, the more stuff you have, the more difficult it is to manage.
Finally, we have a related issue of sustainability: keeping management interested in Big Data projects when fielded on Hadoop architecture. Truth be told, Hadoop aims at hosting “data at scale” for analysis, not for delivering ad hoc, real-time, quick and dirty insights that management craves. All the money spent on infrastructure will certainly grate if bean counters don’t get the results they want.
In the meantime, I worry that too little is being done to make the Hadoop platform capable of avoiding preventable disasters (through effective data protection and infrastructure management) and recovering from disasters that can’t be avoided (such as the fire that takes out an equipment room). Until someone demonstrates that kind of resiliency, I’m not sold on the enterprise readiness of Hadoop or anyone else’s failover clustering.