Operating Systems

Workload Manager Metrics: Percentile Response Time

2 Pages

Bearing in mind the conditions outlined previously where we have a properly defined duration and a reasonable average, these outlier transactions would represent anomalies that must still be accommodated. Where the problem comes in is when the percentile chosen is arbitrary and is used to avoid properly setting other values.

When values such as 70 or 80 percent are used, the indication is that these aren’t outliers as much as they are simply longer transactions consuming services that are being erroneously included in a period they should have long since migrated out of. The most important consideration for percentile-based transaction goals is that the percentage being excluded isn’t being managed at all, although they benefit or suffer based on behavior of those transactions within the percentile! I’m not suggesting that WLM is selective in the transactions being managed, but rather that when a goal is being met, the service class isn’t examined for any additional requirements. In effect, transactions outside the percentile will gain nothing beyond what is already available to the service class. In particular, the point should be considered that if a percentile goal is set low enough, then any combination of transactions can be considered successful.

Consider the case of a goal defined as 50 percent complete within 60 seconds. In reality, this goal means that, for every two transactions in this service class, we’d like one to complete within 60 seconds while the other can run as long as it likes. Using this example, if one transaction runs in 60 seconds and the other runs in 10 hours, the performance index would still be 1.0, indicating that the goal has been perfectly met.

As a more extreme example, consider what would happen if we combined all the service classes into one percentile service class from the data shown in Figure 1.

If we set the goal for this universal service class to 80 percent complete in 0.500 seconds, we’d get a rather interesting result. Combining all the service classes results in 1344.2 ended transactions, of which 1235.1 are TSO first period transactions. The ended transactions are reported for the RMF interval. In this case, even though the data consists of three and a half days’ worth of samples, the ended transactions are the averages for each reported interval.

Even with some variation within the TSO first period transactions, the result of this combination is that the performance index would be 1.0 or less, indicating the goal has been met. In fact, we can easily see this result is absurd and isn’t indicative of actual system activity, yet this is the result of using the percentile goal when the workload isn’t clearly understood.

That result means the transactions outside the percentile goal aren’t being adjusted, nor do they contribute to the service class objectives. In fact, they’re simply running and hoping for the best. The response time goal may be sub-second, but transactions could easily run for hours and WLM wouldn’t take any action to deal with that situation.

So, it’s important that the percentile ranking be used to eliminate legitimate outliers and not simply as a convenient way to avoid understanding the transactions running.

 

Selecting Values

One way to ensure values are being used properly is to assess the distribution of response times and to evaluate how much service is typically consumed by an ended transaction. This combination of information can be used to ensure a response time goal is defined with enough precision to be reasonably confident the work in that service class is consistent enough to be managed in this fashion. Looking at the report in Figure 2, we can immediately see a few points.

The duration (DUR) is defined as 300 service units, although the service consumed per ended transaction is only 28.7102 service units. In effect, we’re allowing transactions to remain in this period for 10 times the number of service units that are actually required by most transactions. This will result in more longer-running transactions remaining in this period, which will skew any averages that might exist.

We also can see that, according to the response time distribution, we have two spikes in the graph; one at 0.250 seconds and one at 2.00 seconds. The second spike indicates we may have transactions in this period that should have transitioned into second period, since they are clearly grouped together as longer running. By reducing the duration (DUR) to 30, we’d bring this definition more in line with what we expect our average transactions to do. In addition, a duration of 30 would be short enough so no active transactions can remain in this period beyond the norm before they are shifted to second period.

Once this occurs, we could still see some comparable transaction distributions, but then the outliers would represent long-running transactions that are largely idle or waiting. Since to remain in this period they can’t be consuming much service, the long-running transactions give us a starting point to begin investigating these long-running but small service consumers. When we understand the work in this period, we can more realistically assess what constitutes a legitimate outlier that should be excluded through the use of the percentile definition vs. poorly performing transactions that you would want WLM to react to by assigning more resources to the service class.

Summary

Overall, response time goals and percentiles are extremely useful definitions to classify specific types of work and to provide the flexibility for managing responsiveness and those exceptions that can skew averages. However, as always, it’s important that the analyst understand the work running in the service class to ensure the work being managed is meeting the true objectives of the installation.

2 Pages