When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)? What does Big Data bring to the organization that the EDW can’t handle?
This article presents a technical and business discussion on the Big Data question. The technical discussion assumes familiarity with data architecture. The business discussion draws some conclusions about the actual application of Big Data in the enterprise.
Big Data vs. the EDW
Big Data hardware is quite similar to the EDW’s Massively Parallel Processing (MPP) SQL-based database servers. EDW vendors include Teradata, Oracle Exadata, IBM Netezza and Microsoft PDW SQL Server. Both Big Data and EDW SQL database servers are composed of large racks of Intel servers (each server called a node) and both distribute data across the nodes. Each node has local hard drive(s) for data storage and doesn’t use a centralized storage system such as a Storage Area Network (SAN) in order to prevent I/O contention.
The first major technology difference is that Big Data’s most common software platform, known as Hadoop, is free, open source and runs on commodity (non-proprietary) hardware. Most EDW vendors use propriety hardware with additional hardware accelerators to allow the servers to work better as a SQL-style relational database. These hardware costs, when combined with the EDW vendor’s proprietary software, typically reach levels that are exponentially higher per terabyte than when using a Hadoop Big Data platform. Hadoop has more limitations than a SQL relational database but is far more scalable at a much lower price.
While Hadoop and EDW databases break large data sets into massively parallel systems, the actual implementation is substantially different. EDW databases parallelize the data across smaller, logical SQL databases that exist on each node. Data is imported via a loading process that divides the data into the logical databases on a row-by-row basis, based on a data key column. This Extract, Transformation and Loading (ETL) process typically does additional data cleansing and data homogenization that matches the data with existing data in the EDW.
In contrast, Hadoop locates source files across a Distributed File System (DFS). Conceptually, each file represents a segment or a partition of a table and files are simply duplicated across three nodes for redundancy. Hadoop data processing then uses direct access to the files. In other words, source files can be copied “as is” into the DFS, which adds metadata to the file to aid in efficient data retrieval. This loading process is simpler and faster than a typical EDW ETL load.
The result: Hadoop can process large volumes of data, assuming the source data files are in a readable and ready state. Hadoop also needs the data to be self-contained, with the database analyst not looking to “join” the data to other data tables as is typically done in a relational database. You can still join data together, particularly using HBase extensions, but you wouldn’t use a normalized data model that’s considered a best practice in SQL database modeling. An EDW database may have hundreds or even thousands of tables. Hadoop requires massive de-normalization and may have as few as one or two tables.
Hadoop doesn’t excel at merging data across nodes at the detail data level, only at the reduce stage. The reduce stage is analogous to merging summarized data. Hadoop requires what’s called co-location of data at the same node. In other words, if data in file A joins with data in file B, then files A and B must exist on the same node. EDW databases don’t have this limitation. While they will perform better with a data model and indexing that co-locates data, in a normalized data model, it’s usually impossible to co-locate all tables. As discussed earlier, EDW vendors have considerable technologies to allow for flexible data models.
Also of interest is the fact that Big Data has a much shorter development time to load data—but a longer development time to query the data. EDWs have long lead times to add data sources to the database, but the time to write SQL queries is relatively short. Hadoop’s map-reduce query language is Java (with other scripting options) with HBase and/or Pig SQL-like extensions that can aid in development. These still require higher skills and more time-consuming effort than simply writing SQL. Business intelligence front-ends are also limited with Big Data as compared with the EDW. This weakness will improve over time as Hadoop matures as a product.
In summary, Hadoop’s limitations with data modeling prevent its application as a data warehousing database. Hadoop methodologies are in many ways brute force and simplicity. But it’s that simplicity that makes it applicable for certain purposes. Many large data sets exist in the enterprise that actually fit within its limitations. When that environment exists, Big Data can bring analytics and Business Intelligence (BI) at warp speed.