Feb 1 ’06
Minimizing Change in Your DB2 for z/OS Database
Changing data in your DB2 for z/OS database can be an expensive process, especially when done in high volumes. While DB2 does provide some efficiency for updating data, there are also several things you can do in your database and process design to minimize the effect of change on the database, and improve application performance. Minimizing change is a great way to reduce the need for database REORGs and backups, and minimizing these types of utilities can improve database availability and reduce overall application CPU consumption.
To achieve high availability and high performance, we can target several areas such as minimizing logging, minimizing updates, and designing for ascending key inserts. The challenging part is knowing the volume of data and designing the tables and applications to support minimal change. Trying to do this after a standard implementation is nearly impossible.
Minimizing changes can dramatically reduce logging overhead and improve application throughput, and reduce the number of locks taken and contention between application processes. There are several database settings that directly impact the amount of logging that occurs during updates.
Use of DATA CAPTURE
For updates, DB2 records undo/ redo records in the log. There’s an option on a CREATE or ALTER TABLE definition called DATA CAPTURE. The two settings are DATA CAPTURE CHANGES or DATA CAPTURE NONE, the default being NONE. For inserts and deletes, DB2 is always recording a full image of the row affected, but for updates, DB2 can minimize the data recorded in the log. The DATA CAPTURE CHANGES feature tells DB2 you’re planning to use the log information for some sort of data replication or log analysis. There are several data replication tools that can read log records or use a DB2 log exit to replicate database changes to other databases or processes. In order for the replication to work properly for updates, it needs the entire before and after image of the rows changed. So, DATA CAPTURE CHANGES will cause DB2 to log the entire before and after images of rows changed via update statements in the undo/redo records in the log.
DB2 will minimize the information logged during updates if DATA CAPTURE NONE is set for a table. A good performance practice is to specify DATA CAPTURE NONE for a table unless you’re specifically planning on replicating the table via a product that reads log records.
Logging for Index Changes
When defining indexes, you can specify COPY NO (default) or COPY YES. COPY YES tells DB2 you’ll be using the COPY and RECOVER utilities on the index. This can be helpful for large, infrequently updated indexes, but it increases the logging for these indexes. Before setting COPY YES for indexes, make sure that recovering the index will be more efficient than rebuilding it. If rebuilding the index is faster (given available CPU), then setting COPY NO will reduce logging overhead for index changes.
Table Design for Minimal Logging
For updates, DB2 logs from the first byte changed to the last byte changed, for fixed length rows or rows with VARCHARs, where the updates don’t cause the row length to change. If the row has a VARCHAR and the update could cause length change, you can minimize logging by placing the VARCHAR before the regularly updated columns. For fixed length rows, logging can be minimized by placing the regularly updated columns together. This way, you can minimize the amount logged. If you take a transaction that runs a million times a day doing updates, this design can prove beneficial.
Figure 1 shows how the table was originally designed and Figure 2 shows the table designed for minimal logging. The original design shows the frequently updated columns mixed in with those that never change but the high-volume update transaction that updates the account balance will log all the data because there’ll be a change to the timestamp and last update user ID columns, and you have to log from the first byte changed to the last.
In Figure 2, we’ve moved the three fields together that are always changing so that, for the high-volume update transaction, we’re logging the minimal amount of data.
Be careful with compression because DB2 will treat the rows as varying length—with length change. Logging occurs only from the changed column to the end of the row, so you may want to consider placing highly updated columns toward the end of the row.
Splitting High-Update Tables
Another way to minimize change and reduce logging and maintenance costs is to separate tables that contain both static columns and heavily updated columns into two separate tables. This type of design works for high-volume tables and requires enough analysis to determine which columns are updated and how often. This works best if there are columns that are always static and columns that always change. Moving the columns that always change to their own table can help reduce logging as well as the amount of data that needs to be REORGed.
Figure 3 shows how analysis of changes to a high-volume table resulted in the splitting of the table into a highupdate and low-update table. The highupdate table could then be adjusted with its own free space settings and REORG schedule, while the low-update table had less aggressive free space and REORG schedule settings.
How DB2 Avoids Change and Taking Advantage of It
DB2 knows when you’re actually changing data. DB2 will actually compare the values of the columns being SET in an update statement to the current values for the row updated on the table. If an update isn’t actually changing the value of the column being updated, then DB2 won’t physically update the column. If possible, DB2 also won’t log the non-updated column. We can look at the projected update frequencies of the columns in our tables and adjust our column ordering to take advantage of this DB2 efficiency. So, we want to take the column ordering discussed earlier to a more granular level by making sure the less frequently updated columns are before the more frequently updated columns. Figure 4 shows our efficient column design to minimize logging and how it takes advantage of the frequency of the updated columns.
Dealing With Blind Updates
There are many situations in which an application must deal with what we call “blind updates,” which occur when an application receives a set of data to be updated in the database, but doesn’t know which of the elements included in the set of data are actually true updates. To be specific, a record can be received as input to the application, but most of the data in the record is already recorded in the database. In some situations, the difference between fresh and stale elements can be extreme (such as a status code change from “order entered” to “order filled”). While DB2 does a good job of detecting when columns are actually changing, it may still be prudent to reduce the number of update statements sent to the database, especially for high-volume applications. It takes CPU and I/O to run these statements, and if we can avoid them for performance reasons, we should.
Being able to detect change in these situations can result in significant CPU and elapsed time savings, especially when the ratio between columns and rows that actually change is low compared to the columns and rows that don’t change. In one particular example, the application was to receive tens of millions of blind updates daily. The data was analyzed to determine which data always changed and which data rarely changed. It was determined that 80 percent of the time, only a handful of columns actually regularly changed values. The application was designed based on this concept to always blindly apply the column updates that were expected to change, and to use a change detection process to apply the changes to the other columns only when a change was detected.
In this case, we employed a Cyclical Redundancy Check (CRC) algorithm to quickly determine when any of the rarely updated columns were changed. The rarely changing columns were hashed via a CRC algorithm and compared to a stored CRC value on the database. This allowed the application to avoid updates to hundreds of columns that rarely changed unless a CRC detected at least one mismatch. This was significant for the millions of blind updates received daily, since applying only the regular changes cost 1.8ms elapsed and 500ms CPU, while updating all the columns cost 400ms elapsed and 25ms CPU. Figure 5 demonstrates this process.
In another situation, we used triggers to both replicate changes to data and to avoid blind updates for that replication. Here, the application requirement was to immediately audit changes to data. Audit tables were established for each base table in the application to record the before images of changed data when a change occurred. This process was accomplished using after update triggers that simply inserted the before image when an update occurred. This worked well except that the application performed blind updates. This resulted in many update statements that updated all columns, including the update timestamp column, when nothing for the given row updated actually changed.
Triggers will still replicate before images to the audit tables when no change occurs. These unwanted changes had to be avoided. This was accomplished with the addition of before update triggers. These triggers compared all the columns of the new row to the columns of the old row, with the exception of the key and update timestamp. If there were no actual changes to the data, an SQLSTATE was signaled back to the application and the update rejected. Figure 6 shows the before and after update triggers.
Inserts and Clustering
There’s another less talked about, but significantly important, advantage to minimizing change. If we can minimize changes to our tables, then we can minimize the table maintenance required. If we have tables that handle high volumes of inserts, we can minimize the impact of those inserts with an “insert at the end” strategy. If we construct our table with an ever ascending key, or by using the MEMBER CLUSTER option with FREEPAGE 0 and PCTFREE 0, we can guarantee new inserts will go to the end of our tablespace data sets. By focusing the inserts at the end of the tablespace, or tablespace partition, we can minimize the data that needs to be REORGed or backed up. Remember that REORGs and copies consume system resources. These costs are sometimes not estimated as application cost, but should be.
Figure 7 shows a tablespace design that accepts inserts at the end. This partitioned tablespace grows from the end, so only the most recent partition needs tablespace maintenance (REORGs and copies). This minimization of changes results in dramatically fewer REORGS and copies than the equivalent design clustered by a non-ascending key or by not using the MEMBER CLUSTER option.
Everything operating in DB2 costs CPU resources. By design, we often don’t have a clean way to minimize the amount of changes we make and the amount of maintenance we’ll have to perform because of those changes. In today’s environment, we must take the amount of change into account when we’re designing our tables because this is where it has to begin. It’s also impossible to achieve 24x7 availability if we don’t minimize changes and maintenance from the beginning. With some understanding of what’s going to be changing and how to minimize the impact of those changes, we can achieve higher throughput and availability.