The keys to good performance are to issue as few I/Os as possible, reduce internal data movement, and minimize table processing. One methodology doesn’t suit all circumstances, so ATPCO used numerous methods:
- Since thread creation is expensive, we took steps to avoid job initiation. When events that warrant a job were infrequent, we made job initiation event-triggered rather than polling on a time interval. When these job-triggering events were too frequent and numerous, we stacked them up and released them at regular intervals.
- Usually, external I/O to and from data files and networks is the slowest component. Our first priority was to tune outside the application by optimizing buffers or cache controllers or accelerating I/O with DFSORT.
- We reduced slack and wait for resources by eliminating resource contention and optimized our batch window by increasing parallelism.
- Avoiding date format conversion reduced internal data movement. Date is a big part of our data attributes; it’s received in different formats from numerous sources. We try to ensure that it’s converted only once to the internal format, and, in most cases, we’ve standardized the date element for minimal conversion.
- We reduced table processing and searches. We learned that in-memory processing is faster than DB2 temp tables, and binary searches are faster than serial searches. We conducted binary search on the one-to-one relationships and serial search only on no-match of the one-to-many relationships. We split tables larger than the maximum allowable sizes and used pointers to map the tables rather than moving data around. We developed a way of pre-calculating the size of the searches so we can use these techniques only when necessary.
- We avoided wide loops by placing unnecessary functions outside the loop and positioning the busiest loop inside while nesting.
- We learned the hard way that run units can be timed out, buffers can be paged out, and inactive files can be quiesced if we don’t optimize the life of threads. We streamlined and minimized external data access by storing data in memory for the duration of the process. We also reduced the size of DBRM and program to avoid EDM pool paging. We identified and converted edits that different modules performed repeatedly.
DB2 tuning efforts that gave us the most performance gains involved:
- Targeting packages that showed high getpage and taking steps to reduce it
- Tracking deadlocks and lock escalations and re-evaluating commit frequency
- Deferring updates until the commit point to minimize the duration of lock
- Evaluating actual usage of indexes and making adjustments accordingly
- Reviewing nested loop joins to confirm they were looping on the smaller table and achieving up to a 90 percent savings by forcing the SQL to loop on the smaller table
- Identifying single object constraints and increasing parallelism by partitioning into smaller objects
- Using the multi-row insert capability introduced in DB2 Version 8 for a 20 percent savings. More savings could have been achieved with the atomic option, but the non-atomic option gave us a unique time stamp (40 to 50 microseconds apart) on a column defined as timestamp with default. Using the multi-row fetch operation yielded an almost 50 percent improvement, but the overall gain was only 25 percent because the multi-row result set had to be adapted to a framework designed to handle one row at a time.
Here are some additional improvements we made and lessons learned in the process:
- A DB2 table indexed on an ever-ascending timestamp column was defined with no free space and no free page. Everything was fine as long as there was only one thread inserting data. When multiple concurrent threads started, the first one put an exclusive lock on the last page for insert. The second thread tried unsuccessfully to get a lock on the target page. Since the target page is the last page, it started browsing from the beginning. We observed the high getpage and corrected the problem by adding some free pages.
- A singleton fetch-only thread bound with cursor stability was providing great performance, but five concurrent threads going after keys that are far apart (no two SQLs targeting the same page) resulted in poor performance. Concurrent threads showed a tremendous increase in synchronous read. We tested on an isolated DB2 system to further research the scenario. At execution of concurrent threads, the buffer pool and other resources were in contention and the optimizer may have decided to do synchronous reads. There must be a maximum limit on the percentage of the buffer pool that can be monopolized by a single object. We corrected the situation by partitioning the table and rendering it into multiple objects.
- We noticed an increased amount of Systems Management Facility (SMF) log offloading when a particular CICS task was running. The CICS-DB2 thread with RELEASE (COMMIT) attribute would cut a thread termination record, SMF-101, at commit time. This particular task was looking for an unused five-digit number. Once the largest five-digit number is used, it looks for unused holes, with every single lookup resulting in a commit to release lock. The application was modified to correct the situation.
- One application that keeps a large set of locks between commits was affected when a monitoring facility capturing IFCIDs, which uses Extended Common Service Area (ECSA) heavily, was turned on. The coincidence was identified and the problem was corrected.
- We observed some SQLs doing table space scans instead of list prefetch. We identified the root cause as a RID LIST exceeded condition. Doubling the RID pool size corrected most of the issues and the remaining problems were handled on a case-by-case basis by tuning SQL and avoiding a large RID LIST.
- A column with possible Y/N value defaulted to cardinality 25 because of insufficient statistics, resulting in poor performance. We corrected this with enhanced distribution statistics.
- An SQL used match columns that weren’t in the index, causing data pages to be searched in the join. Index columns alone would have sufficed.
- A table with a single identity column and no index was used as a next number generator. An application was created to insert a row, get the next number, delete the row, and proceed. The unit of work couldn’t be committed until much later. We noticed that the delete action was taking a table lock rather than a page lock. Table lock for deleting a row with index is justified because it can’t release the index page until commit, but we didn’t understand the reason for the table lock when there’s no index. We modified the application so it didn’t delete the row and added an offline process to delete rows behind the scenes.
With an IT agenda that was dominated by cost cutting, ATPCO management had the foresight to invest in efficiencies. A CPU upgrade was deferred in 2007, major applications were installed with no adverse performance impact in 2008, and the momentum continued in 2009. In January 2010, CTO Steve Daniels explained that adjustments to ATPCO’s mainframe performance improved the entire IT services division and “literally saved the company millions of dollars.”
Figure 1 shows how the 2009 average work units per workday increased by 20 percent, yet mainframe usage declined by nearly 5 percent. All solutions haven’t yet been installed because they’re being completed in stages based on resource availability and business needs.