For many organizations, the question of how to handle Big Data has become much more urgent over the last year. Big Data is of unprecedented size, is typically more unstructured, and is more likely to reside initially on public cloud server farms. It isn’t just a “hot topic”; it’s integral to the effective use of new analytics capabilities that are a high corporate priority.
These organizations typically don’t have monolithic platforms. Their data handling occurs on multiple platforms ranging from mainframes to many PC server grids and farms. Big Data solutions must layer on top of data handling solutions that integrate these platforms to varying degrees.
IBM DB2 10 offers a useful test case for handling Big Data access in a multi-platform environment. IBM DB2 runs in an integrated fashion on all major platform types, from mainframes to Linux and Windows small-server networks. IBM DB2 has recently undergone a major upgrade in version 10 and research shows the new version achieves impressive improvements in performance and scalability across major use cases—including Big Data. So how should we approach maximizing IBM DB2 10 for multi-platform Big Data usage?
Tuning IBM DB2 10 for Big Data
Among the key IBM DB2 10 new capabilities with potential significant positive effects on Big Data analytics and other application performance and scaling are:
• Index, buffer, and workload management improvements
• Continuous data ingest
• Multi-temperature data management
• Adaptive compression and other compression features
• Querying improvements.
Index, buffer, and workload management. This includes “jump scan,” which is better management of indexes in buffers that reduces the need for as many indexes—a capability several early adopters cited as a major performance booster. Another new feature is “smart” pre-fetching of indexes and data, reducing the need for index reorganizations. Also of particular interest is the ability in workload management to set limits on CPU and other resources that DB2 can consume. In the real world, this often allows more effective load balancing of resource consumption between DB2 apps and other apps in other virtual machines, leading to improved performance for both.
Big Data tends to be “index light” (much of the original Web data is stored as semi-flat files), so tuning these features specifically for Big Data probably won’t add much. However, tuning workload management in general pre-operation should deliver positive Big Data performance improvements on platforms that aren’t dedicated to a data warehouse or to a specific operational data handling task.
Continuous data ingest. This ability to “push the envelope” by extending the ability of the data warehouse to accept updated data in near-real-time has two major effects: Data is typically more timely and the need for data warehouse downtime to “mass download” huge amounts of updates is reduced.
Before deployment, data warehousing IT should “tune” IBM DB2 10 Big Data continuous data ingest like a thermostat, running experiments to find the optimal balance between performance on current tasks and near-real-time delivery of actionable Big Data into the data warehouse. A key tuning consideration is the Extract, Transform, Load (ETL) engine because Big Data is less “pre-cleansed” than previous data warehousing data.
Adaptive compression. Most users understand the storage cost savings from compression; fewer understand the significant performance benefits. These benefits come not just because you can load more data into main memory at once, but also because things such as online backup and online reorg are much faster, reducing their performance overhead on regular transaction processing. (By the way, in IBM DB2, the data need not be decompressed during processing in main memory, speeding up things further.)