Root cause analysis and determination are essential in developing a long-term solution. If you can’t determine the underlying cause, you won’t find a solution that’s a long-term cure. You will only mask the symptoms with a short-term fix.
In the mainframe world, once a performance problem is acknowledged, we can use a variety of SMF records and RMF reports to identify the specific performance problem, drill down deeper to determine a root cause, and make needed corrections.
Let’s consider how to do this. Our focus will be on the RMF records/reports in your toolbox and how to use them, as well as mainframe DASD I/O. For a more detailed treatment of these topics, see the IBM Redbook Effective zSeries Performance Monitoring Using Resource Measurement Facility (SG-24-6645-00).
RMF is an IBM performance management tool that measures selected areas of system activity, including various types of channels, I/O devices, FICON directors, and the links providing the connectivity. The data RMF collects is then presented in the form of SMF records. This is the essential data for any FICON-related performance monitoring and management.
Performance Monitoring Basics
Our systems exist to meet the business needs of the user. Humans have what’s often a highly subjective view of the performance of these systems, and this view often can become quite emotional. The concept of the Service Level Agreement (SLA) was introduced to apply measurable criteria reflecting business needs—instead of subjective perceptions—to performance assessment. The SLA is a contract that will define, describe, and enforce measurable specifics such as systems availability. In the performance arena, an SLA will typically address average transaction response time (I/O, CPU, network, or total). Another concept, often used interchangeably with SLAs, is a Service Level Objective (SLO). Before you pursue performance analysis, be sure to set clear performance objectives in the form of SLAs or SLOs.
Performance analysis refers to the techniques and tools used to enforce in your IT systems your SLA or SLO. The goal is to maximize efficient use of your current resources to meet these objectives without excessive tuning efforts. RMF provides an interface to a System z environment and facilitates reporting and detailed measurements of critical resources. RMF issues reports about performance problems as they occur, so you can act before problems become critical. RMF components include three monitors, a post-processor, data servers, reporters, and a Lightweight Directory Access Protocol (LDAP) back-end. The components work together to perform the data gathering and reporting necessary for performance analysis.
You can use RMF to:
• Determine that your system is running smoothly
• Detect system bottlenecks caused by contention for resources
• Identify any workload delayed and the reason for the delay
• Monitor system failures, stalls, and failures of selected applications
• Evaluate the service your installation provides to different groups of users.
RMF comes with three monitors. Monitor III, with its ability to determine the “cause of delay,” offers a wide spectrum of reports for answering performance-related questions. Monitor III provides short-term data collection and online reports for continuous monitoring of system status and solving performance problems. Monitor III is a good place to begin system tuning. It lets the system tuner distinguish between delays for important jobs and delays for jobs that aren’t as important to overall system performance.
RMF Monitor I provides long-term data collection for system workload and resource utilization. The Monitor I session is continuous and measures various areas of system activity over a long period. It produces interval reports created at the end of a defined measurement interval such as 30 minutes. You can get Monitor I reports directly as real-time reports for each completed interval (single-system reports only), or you can run the Postprocessor to create the reports either as single-system or as sysplex reports. Many installations produce daily reports of RMF data for ongoing performance management.
Monitor II provides online measurements on demand for use in solving immediate problems. A Monitor II session can be seen as a snapshot. Unlike the continuous Monitor I session, a Monitor II session generates a requested report from a single data sample.
How do you know where to look and what to look for? Figure 1 represents a high-level view of a single zEnterprise 196, attached via multiple (non-cascaded) FICON directors to an enterprise class DASD array. The DASD array is a generic illustration intended to represent the control unit, devices, and adapters connecting the DASD array to the FICON directors. You can see the RMF reports used with the various components of this environment. These are commonly used for identification, root cause analysis, and resolution (tuning) of I/O-related performance problems in a modern mainframe environment. Figure 1 shows which components of your environment can be analyzed with which report. The reports are:
• SMF 78: RMF I/O Queuing Activity (IOQ) provides information on your installation’s I/O configuration and activity rate, queue lengths, and percentages when one or more I/O components, grouped by a Logical Control Unit (LCU), were busy.
• SMF 73, RMF Channel Path Activity (CHAN) provides basic information about channel path use. It identifies each channel path by Channel Path Identifier (CHPID) and channel path type. It also reports the total channel utilization by the entire mainframe, and channel utilization by individual Logical Partition (LPAR).
• SMF 74-7, RMF FICON Director Activity (FCD) provides useful capacity planning and troubleshooting information for identifying potential bottlenecks and switch latency at the individual port level. The measurements provided for a port in the FCD report include I/O for the system on which the report is taken and all I/O that’s directed through this port, regardless of which LPAR requests the I/O.
• SMF 74-8, RMF ESS Link Statistics (ESS) provides information and statistics on the use and performance of individual adapters on the DASD array.
• SMF 74-5, RMF Cache Activity (CACHE) provides cache statistics on a subsystem and device-level basis.
• SMF 74-1, RMF Device Activity (DEVICE) provides information for all devices in one or more standard device classes or for other devices you specify in the DEVICE option.
Conclusion
The device activity report is the one you’re likely to use first and most in a performance analysis and troubleshooting situation because it contains response time information. From there, you would narrow down what you’re looking at to find the root cause of the problem. You can become proficient with these reports, which means the performance information gap paradox, while real, isn’t insurmountable. Future articles will discuss SMF records in more detail and connect them to new technologies such as MIDAW, zHPF, and DCM.