May 8 ’13

Extending Mainframe Qualities Into the Distributed World

by Glenn Anderson in Enterprise Tech Journal

The Intersection of Workload Manager, the Resource Measurement Facility and zManager Platform Performance Management

The IBM zEnterprise System extends IBM’s mainframe-like governance and qualities of service across heterogeneous, cross-platform applications. The IBM zEnterprise System consists of a zEnterprise Central Processor Complex (CPC)—the z196, z114 or zEC12—along with an attached zEnterprise BladeCenter Extension (zBX), both of which the Unified Resource Manager (zManager) manages as a single, logical, virtualized system. This collection of virtualized systems, including the zEnterprise CPC, along with System x blades and POWER7 blades, is called an ensemble. The zManager—comprised of management areas for virtual server lifecycles, hypervisors, network, operations, energy and platform performance management—ties the pieces together; it’s firmware that executes on the Hardware Management Console (HMC) and Support Element (SE).

This article examines how the new zManager Platform Performance Management (PPM) component uses a policy to manage the distributed side of an application while Workload Manager (WLM) and Resource Measurement Facility (RMF) continue to manage and monitor the z/OS side. As more and more mainframe shops install a zBX and begin moving applications into the hybrid configuration, proper set up and use of the PPM policy is key to maximizing the benefits of the zManager and the zEnterprise System. Let’s explore performance management in this exciting intersection of old and new.

The PPM Component and the Ensemble

The zManager PPM component is responsible for goal-oriented resource monitoring, management and reporting across the zEnterprise ensemble. The concept is to extend the goal-oriented approach of WLM to additional platform-managed resources. The monitoring and management are organized around the hypervisors (see Figure 1). Four different hypervisors can be part of a zEnterprise ensemble:

• The PowerVM hypervisor running on the POWER7 blade
• The KVM-based hypervisor running on the System x blade (referred to as the xHYP)
• The z/VM hypervisor running in a System z Logical Partition (LPAR)
• The PR/SM hypervisor across a System z CPC.

Some number of virtual servers could be running under each hypervisor as guests or virtual machines. These virtual servers could be Windows or Linux under xHYP, AIX under PowerVM, Linux on System z under z/VM and z/OS running in PR/SM LPARs. The zManager communicates across the ensemble with the hypervisors via the Intra Node Management Network (INMN). The zBX can also contain DataPower appliance blades, which don’t participate with the PPM.

Has z/OS been lumped in with Linux, AIX and Windows as just another virtual server in an ensemble? This just doesn’t seem right! But remember that the goal here is end-to-end management and monitoring of cross-platform applications. z/OS is certainly a major player. So, z/OS is viewed in the ensemble as a virtual server running under the PR/SM hypervisor. This will allow the PPM to include z/OS in its end-to-end monitoring view. However, the new ensemble workload policy doesn’t manage z/OS. WLM is the sole manager of z/OS workloads, and Intelligent Resource Director (IRD) is still the only way to dynamically influence the PR/SM hypervisor.

Applying a PPM Policy

The PPM will collect virtual server statistics from all the hypervisors. You will create a new PPM policy to set goals for the virtual servers running on the System x blades, the POWER7 blades and for Linux on System z. Don’t worry; your existing WLM policy will still set the goals for all work running on z/OS. A technique to link the PPM policy to WLM service-class goals provides an end-to-end performance management view of a cross-platform application. The HMC serves as the user interface for defining this new policy and reporting data.

The PPM policy is organized around a structure called a platform workload. Not to be confused with a workload in a WLM service policy, this new platform workload groups virtual servers supporting a business application into a management view. In it, platform resources are presented, reported, monitored and managed. Each platform workload has a performance policy associated with it. This new platform workload performance policy looks similar to a WLM service policy.

Classification rules map virtual servers to service classes, where the goals and importance for the virtual servers are set. This should sound familiar to an experienced WLM administrator. Velocity goals (how much delay you’re willing for work to experience) are assigned at the virtual server level. In a WLM policy, velocity is a value from 1 to 100 percent. The higher the velocity, the less delay the work experiences. Velocity in the PPM policy measures how much CPU delay the virtual server experiences running under the hypervisor. The hypervisor adjusts distribution of CPU resources under PowerVM or z/VM to meet the goals set for the virtual servers. In August 2012, IBM issued a statement of direction regarding its intent to deliver this same support for System x blades. So, in the future, the xHYP will also participate in the CPU resource management function.

Simpler Metrics

The PPM policy metrics differ slightly from a WLM policy. To make your life easier, the always compassionate IBM developers reduced the burden of so many numbers (1 to 100 percent) and now offer just five choices: fastest, fast, moderate, slow and slowest. They did, however, keep the same underlying calculation based on using samples and delay samples. Also, rather than setting importance as 1, 2, 3, 4 or 5, with 1 as most important, the PPM policy uses highest, high, medium, low and lowest.

The PPM will direct the hypervisor to make sure the highest importance virtual servers are meeting their velocity goals, stealing CPU resources from lower importance virtual servers, if necessary. This decision process is similar to WLM’s process for deciding to make a resource adjustment on z/OS. First, monitoring would detect a virtual server not meeting its goal. Then PPM projects the overall impact of moving CPU resources from one virtual server to another. If it looks like a good trade-off based on the policy, the CPU resource adjustment occurs. The arrow in Figure 2 shows this adjustment.

Resource Adjustment

CPU resource adjustment happens under a hypervisor as the CPU resources of the blade or the z/VM LPAR are distributed among the virtual servers. So, just as we mix higher and lower importance work in a z/OS LPAR for WLM to manage, this new hybrid environment would work best with higher and lower importance virtual servers under the same hypervisor. You need receivers and donors, right? These resource adjustments are reported in a resource adjustment report that’s part of PPM monitoring. You can view this report through the HMC as part of the zManager functions. You can also retrieve the data, which is stored for 36 hours of history in the SE of the CPC via an Application Program Interface (API).

So your PPM policy consists of classification rules that assign virtual servers to service classes, typically using the name of the virtual server as the qualifier. Then the service classes assign a velocity goal and importance to the virtual server. A discretionary goal is also available if the virtual server really has no goal. Remember, there’s a policy for each platform workload you’ve defined. Each platform workload also has an importance—to position it in relation to all the other platform workloads in the ensemble.

Workload Balancing

The second management function the PPM provides is ensemble workload balancing. This is similar in concept to WLM’s sysplex routing services. The objective is to use the goal achievement data PPM gathers to influence workload balancing decisions across the ensemble. A protocol called Server Application State Protocol (SASP) provides routing recommendations to workload balancers—IP switches or routers that do load balancing. These routers are outside the zEnterprise ensemble. The HMC hosts the SASP function (see Figure 3).

The recommendations are based on the PPM’s understanding of the current performance of the members of a load balancing group. Metrics such as overall utilization and the delays being experienced by virtual servers are used. The scope of the recommendation is limited to non-z/OS virtual servers. This is because z/OS already has the Load Balancing Advisor (LBA) that provides SASP recommendations. The same SASP client code on a router can interact with both the LBA and the HMC SASP to provide complete coverage across the zEnterprise ensemble. System x and POWER7 blades and Linux on System z can all participate in ensemble workload balancing.

Optional Monitoring Agent

An optional monitoring agent, the Guest Platform Management Provider (GPMP), can run on the virtual servers. This agent is based on an industry standard called Application Response Measurement (ARM). GPMP is a link between the operating system and the zManager, collecting data for the work running on a virtual server. Working together with ARM-instrumented middleware such as an HTTP server, a WebSphere Application Server (WAS) or DB2, GPMP provides metrics for detailed transaction topology as transactions “hop” through virtual servers. This data is collected by an ARM correlator that gets passed from server to server with each transaction as part of the ARM standard. These GPMP agents provide the necessary data for ensemble workload balancing. They’re also required to create the link between the PPM policy and a WLM service policy.

A GPMP running on z/OS serves as the interface between the zManager and WLM. GPMP passes to WLM information about the platformwide performance goals of workloads in which z/OS participates. It sends data WLM provides—including transaction response time and resource data for ARM-instrumented applications—to the HMC for PPM monitoring. WLM manages the GPMP address space on z/OS and displays GPMP status information using the D WLM,AM,ALL console command. We wouldn’t want WLM to have nothing to do for a nanosecond, would we?

A WLM service policy can be configured to provide specific work that arrives on z/OS from another virtual server in the ensemble. If this work has already been assigned a service class via a PPM policy, that service class name can be used as a qualifier to assign the WLM service class on z/OS. For example, consider a three-tiered application that begins with an HTTP server running on a Linux virtual server on a System x blade. The HTTP server receives a Web transaction and, using a plug-in, passes it to the WebSphere Application Server running on AIX on a POWER7 blade. The Java transaction running in WebSphere then makes a Java Database Connectivity (JDBC) call to DB2 on z/OS. DB2’s Distributed Data Facility (DDF) would handle the JDBC transaction. DDF would call WLM to create an enclave and assign it to a service class based normally on DDF classification rules. However, in the zEnterprise hybrid environment, the new EWLM classification rules now available in your WLM Service Policy could handle this classification.

Here’s how it would work. As a result of the PPM policy’s classification rules, the Linux virtual server on the x blade where the HTTP server is running would be assigned to a service class. This service class would be inserted into the ARM correlator as part of the ARM API call made to GPMP by the ARM-enabled HTTP server plug-in. So, the service class name then flows with the transaction in the correlator, first to WebSphere, then on to DDF as part of the JDBC call to z/OS. Because of the presence of the service class name in the correlator, WLM would then look to the EWLM classification rules to assign the service class and goal to the enclave created for the remote JDBC query. So, the WLM goal for the query is set using the PPM service class name as the qualifier. The PPM policy doesn’t set the goal used on z/OS. The PPM policy service class name is a qualifier.

The WLM policy still sets the goal for the work running on z/OS. This PPM-WLM connection required the GPMP agent to be implemented on the virtual servers. All this allows for a synergy between the PPM policy and the WLM policy for transactions that begin elsewhere in the ensemble and end up on z/OS. Now even us mainframers have to admit that’s pretty cool!

Performance Monitoring

Resource management is only half of what the PPM does for us. The other half is performance monitoring and reporting. The objective is to provide a reporting capability that shows usage of platform resources in a workload context within a zEnterprise ensemble scope. Of course, mainframers can never have too much data to examine and study. PPM monitoring looks across virtual servers supporting a workload to provide goal vs. actual reporting that’s limited to platform-level resources.

PPM monitoring isn’t trying to replicate what individual operating system tools do at a more detailed application level. PPM gathers data from the hypervisors and virtual servers across the ensemble via the INMN, storing 36 hours of history in the zEnterprise SE. You can view this data through the HMC using user-selected intervals. The reports begin at the workload level and drill down to service classes, hypervisors and virtual servers. The reports contain a variety of goal achievement data, resource consumption data and performance data. Figure 4 shows an example report. There are specific reports for resource adjustments and load balancing. If GPMP agents and ARM-enabled middleware are implemented, a set of detailed transaction-level reports will be available. These reports follow the topology of transactions from server to server, providing performance data from each ARM hop along the way.

If you want the PPM data to stick around longer than 36 hours in the SE, a set of APIs are available to extract the data organized as workload resource groups. IBM Tivoli Service Management provides support to integrate the zEnterprise hybrid monitoring into the IBM Tivoli Monitoring Infrastructure using the Enterprise Common Collector to display data in the Tivoli Enterprise Portal and record the data into the Tivoli Data Warehouse.

RMF Enhanced

In addition to all the useful new data PPM collects across the ensemble, RMF monitoring was also enhanced with RMF Cross Platform (RMF XP). This new function brings together the Common Information Model (CIM) instrumentation found in all operating systems today and RMF’s Distributed Data Server (DDS) capability. RMF XP currently supports the CIM data collectors in AIX, Linux on System x and Linux on System z. The recently previewed z/OS V2.1 plans to include RMF XP support for Windows Server 2008 running on zBX blades and new SMF 104 records for Linux on System z and Windows Server 2008, AIX and Linux on System x operating systems running on zBX blades.

RMF XP provides a second DDS address space on z/OS called GPM4CIM, which acts as a CIM client to retrieve data from the CIM data collectors on AIX and Linux. RMF XP then makes this data available for viewing just like the current DDS function does for RMF Monitor III—in a browser, either through the RMF data portal or through the resource monitoring task of the z/OS Management Facility (z/OSMF). This isn’t as complicated as it sounds (see Figure 5).

Conclusion

So, this is the world of performance management and monitoring in the zEnterprise System hybrid environment. At the heart is the PPM, providing WLM-like, goal-based management for a collection of virtual servers running on blades and as z/VM guests. A wealth of new cross-platform data enable the traditional System z performance and capacity team to now monitor heterogeneous platforms supporting multi-tiered applications. The qualities of the mainframe extend into the distributed world, expanding the skills of mainframe performance analysts. This is the exciting intersection of WLM, RMF and zEnterprise PPM.