Jan 16 ’14
DB2 Application Design for Big Data
Current use of the term Big Data is a result of the confluence of several industry trends over the last decade. These trends include the accumulation of large amounts of data in databases, the advent of large-scale enterprise data warehouses, multiple means of high-speed data transfer and an explosion in the design and use of sensors for gathering data.
Vendors and software developers responded with many options, including special-purpose hardware and software for data storage and retrieval and complex and powerful analytical software. The NoSQL open source database management systems (DBMSes) emerged as a part of these trends (see https://en.wikipedia.org/wiki/NoSQL).
Many IT enterprises already have in place a Big Data implementation. Whether it’s a pilot project or a mission-critical application, large data stores and analytics packages are now a common occurrence.
There are some warning signs, however, that Big Data may be at the peak of inflated expectations. According to Svetlana Sicular, research director at IT research and advisory company Gartner, Inc., Big Data is entering the “trough of disillusionment” phase of the Gartner Hype Cycle (see http://blogs.gartner.com/svetlana-sicular/big-data-is-falling-into-the-trough-of-disillusionment/). According to Gartner:
“Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters” (see www.gartner.com/technology/research/methodologies/hype-cycle.jsp).
How should IT management act in order to avert failure? More specifically, what should database administrators (DBAs) and application designers do to prepare the enterprise for success?
Advent of the Appliance
When Big Data first appeared, it was characterized by the three V’s: volume, velocity and variety; large volumes of multi-structured and unstructured data arriving at the server in a continuous flood. IT architects were faced with the complex problem of scaling-up data stores while also developing methods to store large objects (LOBs), self-describing data (XML) and multi-structured data (images, audio, video and click-streams) for analysis.
One common approach was to store the largest data tables in a special-purpose Big Data appliance, such as the IBM DB2 Analytics Accelerator (IDAA), while keeping other tables on the main server. The DB2 Optimizer then analyzes SQL queries and chooses access paths to either DB2 tables or tables stored on the appliance based on access path cost.
Another approach was to store the data in DB2 tables, using the capabilities of the DBMS (for example, the ability of DB2 to store native XML data). In the case of DB2 LUW 10.5, some high-performance options are already available. According to IBM:
“IBM DB2 with BLU Acceleration speeds analytics and reporting using dynamic in-memory columnar technologies. In-memory columnar technologies provide an extremely efficient way to scan and find relevant data. Coupled with innovations such as parallel vector processing and actionable compression ... broader SQL support, I/O and CPU efficiencies, and integration with the DB2 SQL compiler, query optimizer, and storage layer” (see http://www-01.ibm.com/software/data/db2/linux-unix-windows/db2-blu-acceleration/).
The classic goal of an enterprise’s first Big Data implementation is a restricted program of descriptive analytics: ad hoc querying and reporting on the analytical data to determine trends and patterns. Use is typically restricted due to multiple reasons, including high costs, lack of personnel experienced in the complex hardware and software environment, and understanding of the business nature of the data is limited to a few analysts.
Forgotten in the initial plunge and excitement are the future potential of applications. While access to the Big Data store is limited at first, in the future, hardware will be faster, data retrieval quicker and DBMS software more flexible in managing high-performance query access.
Applications of the Future
IT architects and application designers must anticipate the requirements by the business to access the new, highly valuable data store. Relational database access will still be done using either a form of SQL or key-value access in the case of NoSQL databases, while the MapReduce framework will be used for processing parallelizable problems across huge data sets, using a large number of CPUs or server nodes. Current and future application designs must be adapted to the new environment.
The initial implementation of a Big Data hardware and software suite may only take into account ad hoc querying against the new data store. Once the proof of concept and pilot project phases are complete, what are the next steps? The most critical element for the IT architect is the realization that future applications will require access to Big Data.
Most Big Data implementations are characterized by extremely fast query times, sometimes reducing queries that run for days using traditional methods down to a few seconds. However, as access proliferates across multiple applications and multiple lines of business, the architect’s job becomes much more difficult. Performance may suffer, and resource capacity planning suddenly becomes an issue. This will drive up costs, as the enterprise struggles to address application needs by upgrading data store sizes and hardware capabilities.
Application Considerations Sans an Appliance
With a higher volume of data being stored in DB2 tables, current best practices need to be reviewed. The most important things to get right in an environment without a Big Data appliance include:
Data archival. It’s essential that the DBA and IT architect develop an efficient database design and purge/archive process that makes data archival possible. For example, in a data warehouse environment, large volumes of data are usually stored in partitioned tables that are physically partitioned by date. In this case, a simple purge/archival mechanism can back up the “oldest” partition, followed by physical or logical partition rotation.
In a Big Data environment, even more aggressive archival may be necessary. Current data (today, this month, this year) can remain in the primary table. Older data should be moved to a secondary storage area that can be queried if necessary. Another possibility is vertical partitioning. Data that is queried more often is kept in one set of tables, data queried rarely in another set. Yet another option involves the temporal data management technology available in DB2 10 (see www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/).
These designs allow the current data to eventually be loaded into an appliance with minimal database changes.
Partitioning schemes. It’s time to take another look at current tablespace and index partitioning schemes. Big Data changes the way the application looks at the data; plus, some appliances will require that data being loaded into the appliance be static. A common method of accomplishing this is with an active/inactive partition scheme. While the active partition is being updated by an application, the inactive partition can be loaded into the appliance. This idea can be extended to current date-partitioned tables by re-defining partitions in active/inactive pairs.
Another consideration is index physical partitioning. Much was made of the inclusion in DB2 Version 8 of the new table-based partitioning. Additional options now exist, including those that support universal tablespaces and allow unique indexes to have included columns. Last, the result of the inclusion of the LASTUSED parameter for indexes is that it’s time to re-visit the reasons for the existence of indexes, especially those created to enhance performance.
The Appliance Cometh
Appliances such as IBM’s IDAA can store large tables in a proprietary format and access the data at high-speed. The DB2 Optimizer has been upgraded to consider access paths to tables in the appliance. Application considerations include:
Data load timing. A typical configuration involves a DB2 table that physically exists in two places: on common disk storage and in an appliance. Depending on the query, the optimizer chooses between the two versions of the table based on the cost of the access paths. However, there’s the issue of timing. The table in the appliance needs to be populated and there are different methods of doing this, including a bulk load or real-time update based on committed transactions on the DB2 log. Each choice has performance and data availability implications.
Sandbox, development, test. How will newly developed applications be tested? Will there be multiple appliances (or appliance instances) for multiple environments? For current ad hoc analysis, there’s probably only a production appliance. As your implementation matures and lines of business develop requirements for application access, this will change. Consider a Big Data store of customer transactions. If analysis of the data indicates that customers would be better served (and profits would be higher) by implementing a new application for customer service to query the data in real-time, performance of the application is now a high priority. To test the application, a test environment with an appliance and full production data image may be necessary.
Which tables to store. While appliances are designed to store huge amounts of data, the intent isn’t to store all enterprise DB2 tables in them. On the other hand, if performance is a concern for queries that join multiple tables, these tables should be instantiated in the appliance. This probably means storing a set of “core” tables in the appliance, along with selected Big Data tables.
Backup and recovery. Business analytics against Big Data implies querying static data. While the portions of the Big Data tables being queried are probably static, other business data may not be. This data is subject to backup and recovery. If a rogue application, or hardware or software malfunction require recovery of a DB2 table, what happens in the Big Data store? If the table exists there, how will it be recovered and how long will it take? This may cause issues with your disaster recovery planning.
With Big Data up and running in your shop, what does the future hold? Lines of business will desire access to this data, and application developers must be prepared to develop and follow new best practices in this environment.
To achieve their return on investment, lines of business may push to implement important or mission-critical business analytics (BA) application solutions that query the Big Data store. Initial implementation success may be problematic. In two recent IDC surveys (the IDC Vertical IT & Communications Survey and the DW and IDC Business Intelligence & Analytics Survey), only 28 percent of respondents were very satisfied or satisfied with the performance of their BA solution. Further, 46 percent of respondents noted there would be an “immediate material negative impact on business operations” if the BA solution was out of service for up to six hours.
Of course, different application types will encounter different concerns. Some major application categories are:
Single query. Most Big Data implementations include the desire to issue ad hoc queries against the new data store. Indeed, there are multiple business intelligence and analytics software packages available from vendors. The next logical step is the creation of internally driven business applications that construct and issue such queries based on specified criteria. As the number of applications increases, the load on the Big Data appliance grows. Data access scheduling and performance now become critical elements of an overall data governance plan.
Reporting. Another category of application is the report, either one-time or regularly scheduled. Many of the initial ad hoc queries will provide valuable insight into the data; hence, business areas will desire to execute these queries regularly, perhaps expanding them to include multiple time periods and multiple geographic areas with subtotals and grand totals. Simple, one-time queries may evolve into daily, weekly and monthly reports. These applications will figure prominently in performance and capacity planning considerations.
Complex SQL, multi-table access. Some applications will require access to data in multiple locations, including online data stores, data warehouse tables and tables in the appliance. Sometimes this may not prove feasible. Some appliances require that an SQL statement executed in the appliance must access tables stored within the appliance. In these cases, IT architects may have no choice but to plan a migration of tables to the appliance, again with performance and capacity planning considerations.
IT architects, DBAs and management must look beyond the current hype of Big Data to the future. An expanding number of applications will require Big Data access. In addition, as internal departments become more familiar with the business value of the Big Data, the number of ad hoc queries and regular reports will expand.
IT will need to address application performance, system performance and resource capacity concerns. These include Big Data appliance hardware and software upgrades, software tools for performance monitoring, acquisition of larger and larger amounts of data storage, and backup and recovery processes.
All of these concerns are part of data governance best practices. Development, documentation, adherence to and review of these practices will be critical elements in determining whether or not Big Data remains an important part of your IT infrastructure.