Sep 1 ’08

Want to Know What DB2 Is Doing? Take a Closer Look at DB2’s TRACE

by Willie Favero in z/Journal

If asked what the three biggest challenges are with DB2 for z/OS, the answer you might hear most often is, “Performance, performance, and performance.” The most jammed sessions at conferences cover performance topics. The most popular articles are on performance, as are most questions on lists and forums.

Is the DB2 for z/OS community obsessed with performance? Are there more performance issues with a mainframe database? Well, no more or less than any other platform. Performance is just something that’s always been a priority. ... The mainframe, and DB2 for z/OS, both offer what can be a blessing and, at times, a curse: great metrics. There are records describing just about everything. How do you obtain this information? One way is with DB2 traces.

Often, when someone has a DB2 for z/OS performance question, one of the best first responses is, “What traces were you running?” After all, how can you determine if something isn’t doing what you expect, within the parameters you expect, if you have no idea what it’s doing? That can be a real problem. You don’t always have the luxury of running an application a second time in an attempt to obtain the same performance results just so you have the opportunity to turn on the appropriate traces.

Should you constantly run all the possible traces you might need? No. At some point, such a “checkup” can cause more damage than the disease. You should be aware of what traces you’re running and their cost. Be careful of others that recommend running certain traces. You must always consider how much harm a trace might potentially cause.

Tracing your DB2 subsystem is important and some monitoring should almost always be performed, but what’s the cost of running one or more of DB2’s traces? The average possible cost of the most popular traces is documented in the DB2 manuals; you can refer to:

But the story doesn’t end there. If you believe what’s published in the manuals, you had better make sure you pick up the correct manual for the DB2 you’re running. The numbers differ, depending on the version of DB2. Differences between the V7 and V8 manuals are minimal, and are more substantial when moving from V8 to DB2 9.

Some of the changes are subtle or affect traces you shouldn’t run anyway. Consider DB2’s global trace. Using it can be extremely expensive and we’ve been incessantly warned against running it. DB2 V6 and V7 state the overhead can be from 20 to 100 percent. The DB2 9 book has completely different values: 10 to 150 percent. These values seem credible. Don’t turn on this trace unless you do so at Level 2 support’s suggestion when trying to track down a service problem.

The more commonly used traces and what you should know about them is discussed in the following sections.

Statistics Trace

Leave this on. It’s “on” by default in your DSNZPARM member and you would have to do something adverse to turn it off. Using the default, you get classes 1, 3, 4, 5, and 6. Class 6 is a fairly recent addition to the default list. This was the class that gave you Instrumentation Facility Component Identifier (IFCID) 225, your DBM1 address space storage map. When problems with the DBM1 storage started back in V7, IFCID 225 records were needed to help determine where the storage problem existed. Initially, it was tough to get people to turn on the extra class 6 recording. Even today, some still don’t see the importance of having this information available, so eventually the IFCID 225 record was just added to stats class 1. Now everyone continuously gets it.

What about the other statistic classes?

For all practical purposes, you can assume that stats records are written only at the end of each statistical interval. That’s why the overhead is negligible when turning on this trace. In fact, no percentages are even quoted in the product manuals. Most of the statistical counters are maintained in memory and are just written out in the SMF data record at the DB2 statistics interval. The overhead is more like noise when it comes to CPU and SMF data volume with SMF writing out only the 100 and 102 records. This is one DSNZPARM you’re better off just leaving alone. Use the defaults to ensure that subsystem-level statistics are always available.

If you’re running a z9, z10, or z990, your stats interval should be quite low; it should be set to five minutes or less. It’s amazing how much can happen on a z10 in just a wall clock minute. You can always lump together multiple intervals, but your stats interval is the most granularity you’ll ever achieve. With a bigger number (15, 30, or more), you have huge intervals with practically no granularity. This can make it quite difficult to identify where a system problem might exist. Even at an interval as low as one minute, you’ll record only a maximum of 1,440 stat intervals per day. For these reasons, I strongly recommend setting the interval to one minute, regardless of processor model. The cost of a statistics trace is still little more than noise. A smaller degree of granularity is needed to study your system’s growth leading up to a system slowdown. It also will enable better virtual storage planning. If the statistics interval is too large, then high points and critical events become more difficult to identify.

Audit Trace

The audit trace has gained popularity with all the compliance and regulatory issues prevalent today. Audit traces aren’t that expensive to run. Up through V8, the overhead was cited as about 5 percent. In the DB2 9 manual, that number has been bumped up to less than 10 percent, which is still a fairly low number if the trace helps you sleep at night. That’s mostly for OLTP with all classes active. For utilities, batch, and some queries, the overhead would be negligible. Remember, the amount of overhead is directly dependent on the number of tables you’re auditing or the frequency of the event being audited. Starting an audit trace with no classes specified will get you only class 1, tracing all access attempts denied due to inadequate authorization (IFCID 140). For many, this is adequate and costs the least to run. Indicating you want class 2 will track all explicit GRANTs and REVOKEs (IFCID 141). Additional cost would be proportional to the number of times a GRANT or REVOKE was used. Invoking class 3 will trace all CREATE, ALTER, and DROP operations against audited tables or a table with multi-level security with row-level granularity (IFCID 142). Classes 4 and 5 will record the first attempted change to an audited object in a unit of work (IFCID 143) and the first attempted access to an audited object within a unit of work (IFCID 144), respectively. An audited object is a table that has been created or altered to include the AUDIT keyword. Specifying the audit class 6 trace will write a record for each invocation of the bind process for both static and dynamic SQL statements that involve audited objects (IFCID 145). When activating an audit class 7 trace, you will record:

The last class available for an audit is class 8. It records the start (IFCID 23), phase change (IFICD 24), and end (IFCID 25) of a utility.

Accounting Trace

There are many ways to affect the cost of an accounting trace. Which trace classes you choose to run will have an impact on the cost of running that trace.

The DB2 9 manual provides time estimates for classes 1, 3, 7, and 8. Class 1 is usually less than 5 percent unless the query is fetch-intensive and multirow fetch isn’t used. Class 3 is less than 1 percent, and classes 7 and 8 are less than 5 percent. It’s the class 2 times that can get tricky. CPU overhead for class 2 trace can range from 1 to 10 percent, leaning toward the high end for fetchintensive programs. Class 2 overhead is on a per SQL request basis (entry/exit to DB2). The simpler the base SQL cost, the bigger the overhead in percentage terms. Class 3 overhead is based on a per wait event basis.

Previous versions of DB2 give no numbers for classes 1, 3, 7, and 8. They talk only about class 2. They quote online activity at 2.5 percent and batch as high as 10 percent.

Not using class 2 will not work. Class 1 is the portion of the transaction time that DB2 is aware of while class 2 is the portion of the class 1 time spent in DB2 (what’s affectionately referred to as “in SQL” time). Class 3 is the portion of the class 2 time spent waiting (lock/latch contention, sync I/O, update commit, etc.). Class 7 is your “in DB2” time and class 8 is your wait time for packages, similar to plan’s class 1 and 2 times. You need them all if you’re trying to troubleshoot an application performance problem. V8 also introduced an additional class for packages. Class 10 contains package detail (IFCID 239). The detailed information for class 10 is available only when classes 7 and 8 also are active. If there are no classes 7 and 8, then there’s no class 10 output. Be cautious when specifying class 10 because it can be expensive.

Here are breakdowns of IFCIDs associated with the most commonly used accounting classes:

Note the IFCID similarities between classes 2 and 7, and classes 3 and 8.

Performance Trace

If you need to run a performance trace, you need to run it and you aren’t going to worry about the cost; you’re probably trying to solve a more pressing problem. Performance trace classes 1 (background events), 2 (subsystem events), and 3 (SQL events) can be from 5 to 100 percent overhead, depending on your DB2 subsystem activity. Fetch-intensive activity will drive the performance class 3 trace toward 100 percent, so be careful. You’ll usually turn on this trace when trying to solve a specific problem. In all, there are 22 classes you can choose from when running a performance trace in addition to three classes available for your own use. Those last three classes can come in handy.

When using a performance trace, specify only the classes, plans, authorization IDs, and IFCIDs you really need to solve your problem. This will minimize the amount of trace data being collected. DB2 9 enhances the START TRACE by allowing greater qualification granularity using the EXCLUDE and INCLUDE keywords. You also should use the Generalized Trace Facility (GTF) as the trace destination. GTF is the default destination for a reason. Trace records are immediately available at the conclusion of the trace and there’s no need to deal with SMF to see your trace results. You do need to be careful if the GTF trace data set is on disk; the trace could wrap. You also want to make sure the trace runs for as little time as possible. Start the trace just before the event you’re trying to trap and stop it as soon as the event concludes.  This also will help minimize the amount of trace data gathered.

All the standard traces are interesting, but running them is of no use if you don’t examine and take advantage of the reports they can yield. Make sure you have the proper reporting tools in place to interrogate and interpret your trace results. If you’re going to run a trace, regardless of the cost, they’re all expensive if you don’t examine trace outputs.

Again, for most DB2 traces, the cost isn’t all that great, and sometimes it’s just something you must absorb. Knowing the cost upfront may make it easier to plan when to run a trace. Running some DB2 traces can (and usually will) be critical to the success of your DB2 for z/OS database.

Now that you have decided to run a few of the DB2 traces, you might benefit from the following hints, which can enhance your DB2 tracing experience.

Starting a Trace

START TRACE is a DB2 command. As such, almost everything you’ll ever need to know about running a trace can be found in the DB2 Command Reference manual (V8 is SC18-7416 and DB2 9 is SC18-9844). The manual tells you how to start and stop a trace. However, starting a trace could be quite intriguing. There can be more to starting a trace than simply issuing the start command.

An appealing feature of the TRACE command is the IFCID keyword. What if you want to trace the mini bind and SQL statement? You can start a performance class 3 trace to trace SQL events. If you do, you’ll be writing out of 25 different IFCIDs when you really want only IFCIDs 22 and 63. Instead, you should start a performance trace specifying class 30, which is a class DB2 doesn’t use, followed by the IFCID keyword listing the specific IFCIDs you want to trace. Now, rather than writing out more than 25 IFCIDs, you’re recording only two. That’s a much better deal. Add in the plan ID and authid on the – START TRACE command and you’ll have a nice, crisp trace output. Every trace type has a couple of IFCIDs reserved for your use. DB2 refers to them as “available for local use.”

For all your traces, there are numerous tracing keywords you can use to reduce the amount of trace data you collect. If you’re concerned about the cost of tracing, this can help you keep things under control. There are keywords you can use to limit your trace collection to specific plans, packages, collections, authids, locations, and many more; all are listed and explained in the DB2 Command Reference manual. You should take time to review what’s possi ble with START TRACE besides just starting a trace.

DSNZPARMs

You also should take advantage of DSNZPARM keywords. The couple that affect how tracing works in DB2 are fascinating. How you should use them, and what you should specify as values for these keywords, seem to constantly change. Consider, for example, STATIME on the DSN6SYSP macro. This is the interval in minutes that DB2 uses to write out statistics records. Both V8 and DB2 V9 have lowered the default for STATIME to five minutes, but even five minutes may be too high for some. Many customers have lowered STATIME to its lowest value—one minute. Statistics are written only at a statistical interval (STATIME). At one minute, that’s only 1,440 intervals each day. I find it very handy to copy the SMF 100 and 102 records to separate Generation Data Group (GDG). If the SMF data is needed at a later time to produce additional reports, having these records in a separate data set will facilitate faster post processing. If you do decide to lower this value, let your z/OS systems folks know. Although it’s not a tremendous amount of additional trace records, it is more. You don’t want to be responsible for filling up one of the SMF MAN data sets in the middle of the day.

There’s another DSNZPARM keyword to consider. The value the “UNICODE IFCIDS” entry on installation panel DSNTIPN supplies (or the UIFCIDS keyword on the DSNZPARM macro DSN6SYSP) tells DB2 whether selected character fields in the IFCID record should be written out in Unicode or left encoded in the same way as previous releases. The descriptive portion of any field in the IFCID that will be encoded in Unicode will be preceded with the string %U.

Another handy DSNZPARM keyword is SYNCVAL on the DSN6SYSP macro. If you have multiple DB2s, as in a data sharing environment, you want to ensure all the statistics records are simultaneously written out, regardless of which member they’re created on. SYNCVAL tells DB2 how many minutes after the hour to start the statistics interval. If you use five on all members, they’ll all write out their statistics records at five minutes after the hour at whatever interval you specified on the STATIME. This will help you coordinate stat records between the multiple DB2s or any other process you’re attempting to sync up with. SYNCVAL also has definite value when trying to sync up with RMF reporting intervals.

There also are installation fields and DSNZPARMs that control whether a monitor trace should automatically start at DB2 start-up (MON keyword on the DSN6SYSP macro) and the size of buffer for storing the records sent by the monitor trace (MONSIZE keyword on the DSN6SYSP macro).

Managing Trace Data

Trace records are usually written out to System Management Facility (SMF) data sets. You need to tell your z/OS systems folks ahead of time which SMF records you’ll be collecting for DB2.

DB2 uses a couple of SMF records when gathering trace data. Accounting goes to SMF 101, Audit ends up in SMF 102, and a bunch of statistic IFCIDs (0001, 0002, 0202, and 0230) go to SMF 100 records. The rest of the statistics IFCIDs end up in SMF 102 records. A performance trace defaults to GTF. However, you can route it to SMF if you choose. GTF is a much better choice because you often want to get to the trace data as soon as possible.

If you’re seeing elongated and unexplained shutdown times for your DB2 subsystem, you may want to check how SMF recording is set up. In the PARMLIB member SMFPRMxx, make sure the DDCONS(NO) is set. Specifying NO avoids the cost of consolidating DDs at shutdown. Other parameters can affect SMF’s performance, such as maintaining three or more SMF data sets, making the SMF data sets as large as possible, using large buffer sizes when dumping data sets, and using the largest blocksize available for the output data set SMF dump uses.

Another consideration about IFCIDs is that they’re all defined (every single field) in a PDS member called DSNWMSGS. If you’re still on V7 (now out of service), you’ll find DSNWMSGS as a member of the PDS hls.SDSNSAMP. For those running V8 and DB2 9, the PDS containing the DSNWMSGS member is called hlq.SDSNIVPD.  DSNWMSGS has a description of each field in an IFCID and contains a detailed description of what the IFCID is used for and a list of trace classes that use the IFCID. An IFCID can be used by multiple different trace classes. To make it easier to access the information in the member DSNWMSGS, DSNWMSGS contains the DDL to create a table (and tablespace) where you can load all the information included in the DSNWMSGS member, the appropriate LOAD utility control cards, and a couple of sample queries for retrieving the IFCID information in a variety of orders.

DSNWMSGS also includes a list of every trace class, by trace type and which IFCIDs are included in that trace class, a complete list of the definitions of all Resource Manager Identifiers (RMIDs) used in tracing, and a list of which mapping macros (DSNDWQxx) map to which IFCID.

Conclusion

To wrap things up, consider a few general statements about using the TRACE command: