May 8 ’12

When Performance & Capacity Planning Processing Become a Problem

by Chuck Hopf in z/Journal

Years ago, when a million CICS transactions a day was a lot, the data went on tape because it was so “huge.” Today, 500 million DB2 transactions may be considered small and, in many installations, the jobs that post-process System Management Facility (SMF) data have become among the largest applications. With the rising cost of software to support the ever-burgeoning volume of data, it’s important to decide what to keep and process daily. The tests run and examples provided here were all conducted with MXG, but the same techniques could be applied to other software performing the same functions.

The Problem

Consider the daily volume of SMF data for a relatively small shop where not all possible SMF records are written. Figure 1 shows 17GB of data with nearly 90 percent of the volume coming from DB2 and CICS. Post processing of the SMF data occurs with MXG and is broken into the following three jobs (not including some weekly jobs; there’s no monthly processing):

• The BASE PDB, excluding CICS and DB2
• The CICS/DB2 PDB
• A job that puts some data from the first two together for reporting.

Figure 2 shows the required processing time of this data in minutes. It doesn’t seem like a lot, but the DB2/CICS job is always one of the top-10 resource-consuming jobs. On a small 2098-T04, 45 minutes of CPU time is a huge chunk of capacity to have tied up for more than an hour every morning, when the system is usually busy.

It’s time to decide if all this is a necessary, useful consumption of resources. If not, how do we fix it? To start the analysis, let’s break data into three categories:

• Tactical data is needed to solve problems. It’s most likely only needed for a matter of days or weeks to resolve any outstanding issues.
• Strategic data is needed for long-range planning. It’s typically highly summarized with retention of several years, but with the rapid evolution of technology, data older than perhaps five years is only an interesting historical artifact; it’s not usually useful for future planning.
• Accounting and security data are used for chargeback and tracking security violations. There can be legal requirements for long-term detail storage. Let the auditors decide.

Clearly, some data crosses boundaries. Job-level data could easily fall into all three categories. In those cases, the category with the longest retention wins.

Figure 2 shows us that the processing of DB2 and CICS consumed 90 percent of the total CPU time as well as 90 percent of the volume of data. A series of MXG benchmark tests was run to determine the major CPU consumer:

1. Run BUILDPDB with normal MXG defaults
2. Run BUILDPDB but suppress processing of CICS records
3. Run BUILDPDB but suppress processing of DB2 records
4. Run BUILDPDB but suppress processing of TYPE 74
5. Run BUILDPDB but suppress processing of DB2 and CICS
6. Run BUILDPDB but suppress processing of DB2 CICS and TYPE 74
7. Run BUILDPDB but suppress DB2ACCT
8. Run BUILDPDB but suppress CICSTRAN DB2ACCT
9. Extract one hour of CICSTRAN and DB2 data.

Test 9 will come into play later. The results of the testing in Figure 3 showed that:

• The bulk of the CPU time was being consumed by DB2ACCT (type 101 records).
• Processing of CICS transaction data wasn’t a major factor.
• Type 74 data wasn’t an issue—though in a larger environment with more than a single large Redundant Array of Inexpensive Disk (RAID) box, processing type 74 data can become onerous.

For this shop, processing type 101 and type 110 subtype 1 records in separate jobs would yield the best throughput and earliest completion of the daily jobs. But is that the best answer? Do we really need to process all that data daily or do we have options?

Thirty years ago, there was often a time in the wee hours of the morning when none of this would have mattered. The system was effectively idle between the end of the batch cycle and when the online activity began in the morning. However, in today’s world of non-stop availability with transactions coming in from around the globe, that window of opportunity has usually vanished. Reducing the system resources required for SMF processing may be critical. In the eyes of some management, it’s pure overhead.

Consider the three data types again:

• Tactical. When there are problems in DB2 or CICS, the data may be critical in finding and resolving those issues. But if a problem lasted for 15 minutes, do we need the data for the entire 24 hours or do we just need the time before, during, and after the problem? Test 9 showed that we can quickly extract the data for a one-hour period when necessary.
• Strategic. Certainly we need to keep track of DB2 and CICS transaction volumes, response time, and consumption for long-term trending. But do we need the detail data to do that or can we find another way?
• Accounting. If we’re doing detailed chargeback, it may be required that all data be processed and retained for some period. There can also be legal requirements if there’s outsourcing; only your DP auditors know for sure and there’s a good chance that if you ask, the answer will be the ever-popular but impractical “forever.” Changes in technology and the volatility of storage devices may preclude forever, but five to seven years isn’t uncommon.

The two types of transactions most problematic are Distributed Data Facility (DDF) and CICS. If we can find a way to fulfill most of the reporting/accounting needs without processing the millions of detail records, a substantial amount of system time and resources can be saved. When that processing is eliminated, the post-processing of SMF data goes back to being noise in the system.

So, what’s possible?

For DB2, there are multiple types of transactions: CICS, BATCH, DDF, and so on. In many shops, the dominant workload is rapidly becoming DDF transactions. If we know enough about the transactions, Workload Manager (WLM) will let us classify the workload based on information contained in the headers of the accounting records. Then, using a report class, we can gather the information needed to satisfy reporting needs from the Resource Management Facility (RMF) type 72 data.

There are several fields available in the Application Program Interfaces (APIs), whether Java or DB2 Connect, that can be used to identify transactions for DB2 classification. Some—such as SYSTEM, DB2 SUBSYSTEM, PACKAGE NAME, AUTHORIZATION ID—are fundamentally useless for our purposes. Fields that could be used are accounting information and application name. They’re available in the API to the programmer at the distributed end. It’s a matter of getting them to use the fields. This is somewhat akin to trying to pull the teeth of a tiger without benefit of anesthesia. It’s resisted as being too complicated (it isn’t), or too expensive (it isn’t), or too late in the development cycle to make a change now (and that’s probably the closest to the truth). It likely requires management pressure and changes in standards to be effective. If there’s a standard that says the fields must be completed with a standardized format, there can then be no argument.

We can modify the WLM policy to assign a report class to as many or as few of these groupings as may be needed to track usage for accounting and the volume statistics for reporting and planning. In the type 72 records for DDF, we get a transaction count, average response time, and CPU consumption, which will generally be all we need to satisfy short- and long-term reporting requirements.

What about the rest of the DB2 world? Most of those will be batch jobs where the resources consumed are recorded in the type 30 SMF records (without being able to separate the DB2 portion from the non-DB2 portion) or CICS transactions.

What do we do about CICS? We can define report classes for CICS transactions. That gives us a count of transactions and response time but not the CPU and other resources consumed by the transactions. Certainly some of that data could be gleaned from the CICS statistics records and that might be enough to solve the ongoing reporting issues without processing detailed CICS transactions. But since CICS volumes are being overwhelmed by DB2 volumes, processing the CICS records may not be the worst that could happen.

Would sampling of the data be adequate for capacity planning? Perhaps if a one-hour sample were taken daily, given transaction counts from RMF report classes, you might be able to project CPU and resource consumption for CICS as well as DB2. That and problem resolution were the purposes of test 10. For this site, extracting a single hour of DB2 and CICS data ran in under 15 minutes of both elapsed and CPU time. This demonstrates that it’s possible to extract data for problem solving or for ongoing sampling relatively quickly and painlessly.

But what if you’re stuck with detailed accounting and a requirement to process all the DB2 and CICS data? It can still be done, but the smart way to do it is to divide and conquer. The rest of this article will demonstrate how to do it in MXG.

MXG 29.04 provides examples that break processing of SMF data into these pieces:

• CICS transactions – type 110 subtype 1 records
• DB2 accounting records – type 101 and 102
• MQ records – types 115 and 116
• I/O-related record types 14, 15, 42, 61, 64, 65, 66, 74, and HSM
• All other SMF data.

Job Control Language (JCL) is provided to split the data using IFASMFDP.

Figure 4 shows results of using a 3GB sample of data and the provided JCL* members for z/OS and BLD* members for ASCII execution. For these tests, BLDSIMPL and JCLSIMPL are jobs that read and process all the SMF data in a single pass. BLD and JCL SPSMA SPSMB SPSMC SPSME read portions of the SMF data that have been split off. BLDSPUOW and JCLSPUOW combine the DB2 accounting data and CICS transaction data at the Unit of Work (UOW) level. The comparisons of run-time are between the SIMPL jobs and the sum of the DB2 and UOW jobs while the CPU times are the sum of all the parallel jobs and the UOW job compared to the SIMPL job.

There were a total of 2,353,851 SMF records with 917,607 DB2 accounting records and 255,522 CICS transactions for these tests. Clearly, regardless of the operating system, some significant reductions in the elapsed time to process the data and the resources consumed in processing can be achieved.

Conclusions

Should we process all the data every day? As usual, the answer is, “It depends!” If detailed accounting is a requirement, then it may not be possible to avoid processing all the DB2 and CICS transaction data but, if detailed accounting isn’t a requirement, it may be possible (through judicious use of RMF report classes and some sampling of CICS and DB2 data) to avoid that burdensome and expensive processing. If it can’t be avoided, then mechanisms are available to divide the processing into more bite-sized pieces or offload the processing to an ASCII platform for greater efficiency.

How long should data be retained (see Figure 5)? Some choices are obvious. Type 99 records serve no real purpose other than problem determination and then only if IBM asks for them. Should you record them? Absolutely! The cost of recording them and keeping them for a few days is minimal. If you do have a problem that might be related to WLM, IBM will request type 99 records. If you don’t have them, you will have to go back and try to re-create the problem. If you have them, then the level 1 and 2 help desk folks get to do some work.

What about accounting data? Why ask auditing? If you don’t receive advice from auditing, it’s likely your management will have to answer for an audit “finding” at some point. If you’re following auditing guidance, you’re on solid ground. If they insist you keep all the detailed DB2 and CICS accounting data, do a cost analysis. Figure out how many new tapes it will take per year to retain the data. High-density tape cartridges are expensive. Be sure to include the resources needed to make the copies and the Mean Time to Failure (MTTF) of the media. Even CDs have a MTTF. You may not have the money in your budget.

With the rapid pace of change, how valuable is five- or 10-year-old RMF data? There have likely been one or more architectural changes in the interim and almost certainly many application changes. While it sometimes makes for interesting archaeology, it may not be useful for planning purposes and certainly isn’t useful for problem-solving.

We may tend to be pack rats, but the storage and processing of all this data can become expensive. We should manage our own applications—just as we demand that others manage theirs.