Forgotten in the initial plunge and excitement are the future potential of applications. While access to the Big Data store is limited at first, in the future, hardware will be faster, data retrieval quicker and DBMS software more flexible in managing high-performance query access.
Applications of the Future
IT architects and application designers must anticipate the requirements by the business to access the new, highly valuable data store. Relational database access will still be done using either a form of SQL or key-value access in the case of NoSQL databases, while the MapReduce framework will be used for processing parallelizable problems across huge data sets, using a large number of CPUs or server nodes. Current and future application designs must be adapted to the new environment.
The initial implementation of a Big Data hardware and software suite may only take into account ad hoc querying against the new data store. Once the proof of concept and pilot project phases are complete, what are the next steps? The most critical element for the IT architect is the realization that future applications will require access to Big Data.
Most Big Data implementations are characterized by extremely fast query times, sometimes reducing queries that run for days using traditional methods down to a few seconds. However, as access proliferates across multiple applications and multiple lines of business, the architect’s job becomes much more difficult. Performance may suffer, and resource capacity planning suddenly becomes an issue. This will drive up costs, as the enterprise struggles to address application needs by upgrading data store sizes and hardware capabilities.
Application Considerations Sans an Appliance
With a higher volume of data being stored in DB2 tables, current best practices need to be reviewed. The most important things to get right in an environment without a Big Data appliance include:
Data archival. It’s essential that the DBA and IT architect develop an efficient database design and purge/archive process that makes data archival possible. For example, in a data warehouse environment, large volumes of data are usually stored in partitioned tables that are physically partitioned by date. In this case, a simple purge/archival mechanism can back up the “oldest” partition, followed by physical or logical partition rotation.
In a Big Data environment, even more aggressive archival may be necessary. Current data (today, this month, this year) can remain in the primary table. Older data should be moved to a secondary storage area that can be queried if necessary. Another possibility is vertical partitioning. Data that is queried more often is kept in one set of tables, data queried rarely in another set. Yet another option involves the temporal data management technology available in DB2 10 (see www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/).
These designs allow the current data to eventually be loaded into an appliance with minimal database changes.
Partitioning schemes. It’s time to take another look at current tablespace and index partitioning schemes. Big Data changes the way the application looks at the data; plus, some appliances will require that data being loaded into the appliance be static. A common method of accomplishing this is with an active/inactive partition scheme. While the active partition is being updated by an application, the inactive partition can be loaded into the appliance. This idea can be extended to current date-partitioned tables by re-defining partitions in active/inactive pairs.
Another consideration is index physical partitioning. Much was made of the inclusion in DB2 Version 8 of the new table-based partitioning. Additional options now exist, including those that support universal tablespaces and allow unique indexes to have included columns. Last, the result of the inclusion of the LASTUSED parameter for indexes is that it’s time to re-visit the reasons for the existence of indexes, especially those created to enhance performance.