Feb 1 ’06

Workload Manager Metrics: Percentile Response Time

by Editor in z/Journal

When setting goals in the Workload Manager (WLM) policy, the consensus is that response time definitions are the best to use because they are actually measured values and represent the real behavior of the workload. However, we should consider what a response time goal, and more specifically, using the percentile definition, actually means. First, let’s consider the response time definition.

Response Time

Response time is a measured result that clearly indicates the specific behavior of a workload and is highly accurate in reflecting those results. However, some points need to be remembered. A response time measurement does the first transaction no good. Since the response time can’t be known until the unit of work completes, the premise of a response time metric is to use the information gained to determine resource requirements for subsequent work units. So, it’s an absolute requirement of the response time definition that a sufficiently steady flow of work exists in the service class for those learned lessons to be retained by WLM. The failure to maintain a sufficiently high-level of work units will result in each new startup beginning the process over from scratch or with samples dominated by history data.

Several factors can mitigate against this occurrence. Usually, a response time objective is also paired with a duration (DUR) definition. This also plays a role in the response time, since it’s creating a fundamental definition that prevents a transaction from remaining in a particular period indefinitely. By using this limit, longer running transactions are ultimately aged out of this period, which also will tend to make the goals more reasonable. This is one of the mitigating factors against the initial problem of learned lessons because it prevents too much divergence from previous behaviors, so historical data retained by WLM will be of more practical value in managing future work.

Often, the use of response time makes good sense if the work is extremely short running and voluminous. In this case, it’s too short for any sampling techniques and the high volume ensures a large sample set from which to develop a reliable picture of the transaction’s requirements. Include a realistic DUR parameter and the work within a particular service class period can be highly predictable and controllable in meeting its objectives. The presumption of meeting objectives assumes the work is sufficiently important and isn’t being unduly pre-empted by other service classes. If available resources are insufficient, then no set of definitions will provide good performance.

The situation is less clear when we get into longer response times and fewer transactions. The first transaction establishes the requirements of the service class, so the shorter the transaction and the more of them that exist, the better the samples returned to WLM and the more accurately the requirements can be assessed. If we have transactions that may have response times on the order of hours in addition to have only two or three running during any interval, the results available to WLM to effectively manage the response time of this workload is severely compromised. Response time, unlike velocity, is a singular event; it occurs only once in the life of a transaction.

When a transaction ends, it can contribute a value to WLM. However, until then, WLM is completely unaware of how well or poorly the work is performing. While WLM will have information gained from previously completed transactions, the ability to intervene in these “in-flight” transactions is severely restricted, since the outcome is unknown. If we take an example of three jobs that are defined with a response time of two hours, we can see the problem easier. Let’s also assume these jobs are staggered so they are expected to start and end at different times. Only after two hours can WLM determine whether the goal is being met. If the first job isn’t completed, then the goal has been missed but WLM can’t yet begin to assign resources to improve the response time because it isn’t yet a known quantity. However, even when the job ends, the problem in this case is that we’ll have exactly one sample to evaluate. In addition, the performance index can’t even be calculated since the work isn’t done, so no policy adjustment can be made until the response time interval completes. When the other jobs end, we’ll have consumed several hours of elapsed time and will end up with three samples from which to evaluate a response time policy.

While it can certainly be done, the practical outcome is that the ability of WLM to respond, especially if the service class can’t be kept continuously busy, results in an extremely long reaction time. In fact, given a steady flow of work, it may take days before WLM finds the precise resource mix needed to succeed. In short, response time goals are a useful goal setting when they are short enough to receive quick reaction times and voluminous enough to keep the feedback level to WLM high.

Percentile Response Times

Even with the caveats described previously, the precise control of transactions isn’t always possible. In some cases, there may still be legitimate outliers where the behavior of the transaction is sufficiently out of line so its existence skews the average. In these cases, it may be desirable to eliminate them from the calculation, so the use of a percentile goal is warranted. However, let’s be clear that an AVERAGE response time presumes some transactions will run longer than desired while others run shorter than desired, but they will be operating within reasonable ranges of each other so the average is within the defined goal. The concept of the outlier presumes the exceptional transactions are never offset and they must be eliminated because their influence on goal management is erroneous.

Bearing in mind the conditions outlined previously where we have a properly defined duration and a reasonable average, these outlier transactions would represent anomalies that must still be accommodated. Where the problem comes in is when the percentile chosen is arbitrary and is used to avoid properly setting other values.

When values such as 70 or 80 percent are used, the indication is that these aren’t outliers as much as they are simply longer transactions consuming services that are being erroneously included in a period they should have long since migrated out of. The most important consideration for percentile-based transaction goals is that the percentage being excluded isn’t being managed at all, although they benefit or suffer based on behavior of those transactions within the percentile! I’m not suggesting that WLM is selective in the transactions being managed, but rather that when a goal is being met, the service class isn’t examined for any additional requirements. In effect, transactions outside the percentile will gain nothing beyond what is already available to the service class. In particular, the point should be considered that if a percentile goal is set low enough, then any combination of transactions can be considered successful.

Consider the case of a goal defined as 50 percent complete within 60 seconds. In reality, this goal means that, for every two transactions in this service class, we’d like one to complete within 60 seconds while the other can run as long as it likes. Using this example, if one transaction runs in 60 seconds and the other runs in 10 hours, the performance index would still be 1.0, indicating that the goal has been perfectly met.

As a more extreme example, consider what would happen if we combined all the service classes into one percentile service class from the data shown in Figure 1.

If we set the goal for this universal service class to 80 percent complete in 0.500 seconds, we’d get a rather interesting result. Combining all the service classes results in 1344.2 ended transactions, of which 1235.1 are TSO first period transactions. The ended transactions are reported for the RMF interval. In this case, even though the data consists of three and a half days’ worth of samples, the ended transactions are the averages for each reported interval.

Even with some variation within the TSO first period transactions, the result of this combination is that the performance index would be 1.0 or less, indicating the goal has been met. In fact, we can easily see this result is absurd and isn’t indicative of actual system activity, yet this is the result of using the percentile goal when the workload isn’t clearly understood.

That result means the transactions outside the percentile goal aren’t being adjusted, nor do they contribute to the service class objectives. In fact, they’re simply running and hoping for the best. The response time goal may be sub-second, but transactions could easily run for hours and WLM wouldn’t take any action to deal with that situation.

So, it’s important that the percentile ranking be used to eliminate legitimate outliers and not simply as a convenient way to avoid understanding the transactions running.

 

Selecting Values

One way to ensure values are being used properly is to assess the distribution of response times and to evaluate how much service is typically consumed by an ended transaction. This combination of information can be used to ensure a response time goal is defined with enough precision to be reasonably confident the work in that service class is consistent enough to be managed in this fashion. Looking at the report in Figure 2, we can immediately see a few points.

The duration (DUR) is defined as 300 service units, although the service consumed per ended transaction is only 28.7102 service units. In effect, we’re allowing transactions to remain in this period for 10 times the number of service units that are actually required by most transactions. This will result in more longer-running transactions remaining in this period, which will skew any averages that might exist.

We also can see that, according to the response time distribution, we have two spikes in the graph; one at 0.250 seconds and one at 2.00 seconds. The second spike indicates we may have transactions in this period that should have transitioned into second period, since they are clearly grouped together as longer running. By reducing the duration (DUR) to 30, we’d bring this definition more in line with what we expect our average transactions to do. In addition, a duration of 30 would be short enough so no active transactions can remain in this period beyond the norm before they are shifted to second period.

Once this occurs, we could still see some comparable transaction distributions, but then the outliers would represent long-running transactions that are largely idle or waiting. Since to remain in this period they can’t be consuming much service, the long-running transactions give us a starting point to begin investigating these long-running but small service consumers. When we understand the work in this period, we can more realistically assess what constitutes a legitimate outlier that should be excluded through the use of the percentile definition vs. poorly performing transactions that you would want WLM to react to by assigning more resources to the service class.

Summary

Overall, response time goals and percentiles are extremely useful definitions to classify specific types of work and to provide the flexibility for managing responsiveness and those exceptions that can skew averages. However, as always, it’s important that the analyst understand the work running in the service class to ensure the work being managed is meeting the true objectives of the installation.