The Myth of Big Data

White papers and blog entries about Big Data tout it as being of critical concern to data warehouse applications. But these two words oversimplify incredibly varied implementations. Let's consider the real-world meanings of this phrase:

Big datastore. Consider a data warehouse containing 100TB of data. Sounds pretty big, doesn’t it? Yet the size of the warehouse has no meaning without some measure of data movement and data retrieval. The size of the warehouse may prohibit executing table space reorgs and index reorgs. This leads to database design that incorporates bulk loads of data pre-sorted into clustering sequence. Full image copies may also strain media storage resources. Smart table and column naming conventions are a must, as are standard processes for ETL. A metadata catalog is a necessity, as well as a data warehouse data model with software tools to support both.

Big data movement. The rate of data retrieval from external systems may depend on the ability of batch jobs to extract data, file transmission methods, and external network feeds. There may be multiple steps involved in transforming the data received. Another Big Data issue is data archiving and purging. The data warehouse and its processes must be designed so these are low-cost. One common method is partitioning fact tables by date or date range. This allows a purge of the partition by the simple expedient of a load utility using no data. A variation is to use rotating partitions.

Big analytics. Given the large size of your data warehouse implementation, data warehouse process management and BI analytics management must coordinate closely. One option is a variation of the shadow table concept. A frequently used data warehouse table is designed as two partitions. For tables already partitioned by date range, simply double the number of partitions with two partitions reserved for each range. Another option is special-purpose hardware such as the IBM Smart Analytics Optimizer.

Big performance. Here, the emphasis is entirely on query performance. Database administration takes the lead, analyzing queries using basic EXPLAINs or software tools such as IBM's Data Studio. The goal is resource balancing. Database administrators will use rich statistics, index design, and access path analysis to take advantage of certain access path and database design options.

You may experience some of these Big Data issues, but having a lot of data, by itself, isn’t a major concern.