Sep 1 ’07
Performance Management: Intelligent Resource Director and the SRM Constant
The Intelligent Resource Director (IRD) component of Workload Manager (WLM) involves CPU management, dynamic Channel Path Identifier (CHPID) management, and channel subsystem priority queueing. This article explores the effects of IRD on the CPU management portion. The CPU management part of IRD is further divided into two functions: Logical Partition (LPAR) CPU management and weight management. For a more extensive examination of these functions, refer to the Redbook z/OS Intelligent Resource Director (SG24-5952).
In IRD CPU management, two actions may occur. The number of logical processors defined to a particular LPAR may be changed based on the weight assigned. The objective is to have as few Logical Processors (LPs) in use as the weight will allow and to increase the number of LPs only to accommodate the weight requirements of the partition.
Weights also may be changed based on the performance objectives of service classes in the partition. The effect is that, if a more important service class is missing its goals, CPU resource (LPAR weight) may be dynamically taken away from the less important LPAR and assigned to the more important. This weight change may require changing the number of LPs; so the two actions are related in this aspect of IRD CPU management.
The SRM Constant
The SRM constant is used to normalize the service charged to a unit of work to make it independent of time. Service policy definitions and performance decisions shouldn’t have to be reevaluated every time the processor configuration changes.
The SRM constant is used to take the service time (in seconds) and multiply it by the processor constant (service units per second), which should result in a value that’s relatively uniform across different processor models and configurations.
For example, on a 2084-302, the SRM constant is about 20,752 service units (SU)/sec while on the 2084-323 the SRM constant is 14,171 SU/sec. By using this value on a unit of work running in each environment that consumes one second of CPU time, the SUs charged would be 20,752 and 14,171, respectively.
IRD and the SRM Constant
When an LP is added to an LPAR, the SRM constant is recalculated to reflect the new configuration. However, this recalculation does not occur when IRD adds or removes the LP.
Since the SRM constant is used to normalize CPU time based on the processor model, the effect of IRD actions is that of dynamically changing the CPU model based on service class goals. This can have a significant impact on calculations and measurements if it’s not taken into account.
The SRM constant in use will be the value determined by the LPAR definition. So if the processor has six LPs defined, then the SRM constant used will reflect that state. Using an example of the 2084-3xx, the six LPs will be treated as a 2084-306 with the SRM constant of 18626.3097. Since this value will be used to determine how much service work should be charged based on the CPU service time consumed, it matters how we assess this.
For example, if IRD removes two LPs, the SRM constant will remain at 18,626, although the processor itself will behave as a 2084-304. This will add more CPU time, although it will be charged at the lower rate. The effect is to understate the amount of service actually consumed. If this value were used without adjustment in a capture ratio calculation, the error could be on the order of 50 to 60 percent.
Issues to Consider
A significant consideration is how frequently LP adjustment occurs and how long it remains in a state other than that in which it was defined. Whether the state is above or below the defined model will determine the effect the SRM constant has on workload behavior.
This effect is evident in the DUR parameter used to define service class periods. The duration is specified in SU to indicate how long a unit of work should remain in a particular performance period before it transitions into the next period.
If we use the example previously mentioned, the IRD adjustment of the 2084-306 to a 2084-304 would allow transactions to remain in a period a bit longer than originally defined.
For example, if 500 SUs were defined for a performance period and IRD reduced the processor to the 2084-304, the effect on the duration would be to run the workload as if it were defined at 524 SU. While this isn’t a large deviation, it can certainly be more profound for longer durations. Moreover, it could result in more transactions remaining longer in the original period, which could skew goal averages. Similarly, other policy definitions defined by SUs (such as resource groups) can experience behavior variations compared to their original specifications.
Another consideration is the velocity goal. Since the velocity goal depends on the number of processing engines available, then indirectly the IRD actions will be influenced by the goal specification. If a velocity goal is aggressive, then IRD will tend to add engines; if the goal is too lax, then engines can be readily removed. While these actions will always be tempered by the importance of the service class, you should avoid having high-importance service classes engaged in a self-fulfilling prophecy based on their goal definitions.
Figure 1 shows an example of how these variations occur. Consider the ratio of CPU service (SU/sec) to CPU busy percentage. These values don’t appear to have any pattern until they’re overlaid with the predicted values for the four-engine and six-engine values. Clearly, the service being charged to workloads is based on engines being brought online or taken offline. A similar mechanism can be used to recalibrate capture ratios if they’re found to be unreasonably high or low.
The effect of IRD on SU calculations can be quite significant, depending on the range of actions that may be involved. This can be important if these values are being used in ancillary functions, such as chargeback, since work may be undercharged or overcharged based on IRD decisions.
Since the choice of SRM constant occurs when the partition is activated, it becomes increasingly important that the base definition reflect as clearly as possible the most common state the partition is expected to be in. Variations must then be accounted for to evaluate the impact of changes in the use of the SRM constant in the various roles in which it may occur. While it’s possible to move processors over a sizable range, the best use is when the LPAR cluster consists of systems that are comparable— with minimal logical processor movement between partitions.
From a processor perspective, there are no negative consequences of specifying the maximum number of logical processors and having WLM IRD vary off those that aren't needed for the current demand. However, this can wreak havoc with the uses of the SRM constant and its subsequent effect on WLM resource management. Unless the number of LPs online is explicitly understood when these numbers are evaluated, the conclusions drawn will be wrong.