Operating Systems

Workload Manager Metrics: Percentile Response Time

2 Pages

When setting goals in the Workload Manager (WLM) policy, the consensus is that response time definitions are the best to use because they are actually measured values and represent the real behavior of the workload. However, we should consider what a response time goal, and more specifically, using the percentile definition, actually means. First, let’s consider the response time definition.

Response Time

Response time is a measured result that clearly indicates the specific behavior of a workload and is highly accurate in reflecting those results. However, some points need to be remembered. A response time measurement does the first transaction no good. Since the response time can’t be known until the unit of work completes, the premise of a response time metric is to use the information gained to determine resource requirements for subsequent work units. So, it’s an absolute requirement of the response time definition that a sufficiently steady flow of work exists in the service class for those learned lessons to be retained by WLM. The failure to maintain a sufficiently high-level of work units will result in each new startup beginning the process over from scratch or with samples dominated by history data.

Several factors can mitigate against this occurrence. Usually, a response time objective is also paired with a duration (DUR) definition. This also plays a role in the response time, since it’s creating a fundamental definition that prevents a transaction from remaining in a particular period indefinitely. By using this limit, longer running transactions are ultimately aged out of this period, which also will tend to make the goals more reasonable. This is one of the mitigating factors against the initial problem of learned lessons because it prevents too much divergence from previous behaviors, so historical data retained by WLM will be of more practical value in managing future work.

Often, the use of response time makes good sense if the work is extremely short running and voluminous. In this case, it’s too short for any sampling techniques and the high volume ensures a large sample set from which to develop a reliable picture of the transaction’s requirements. Include a realistic DUR parameter and the work within a particular service class period can be highly predictable and controllable in meeting its objectives. The presumption of meeting objectives assumes the work is sufficiently important and isn’t being unduly pre-empted by other service classes. If available resources are insufficient, then no set of definitions will provide good performance.

The situation is less clear when we get into longer response times and fewer transactions. The first transaction establishes the requirements of the service class, so the shorter the transaction and the more of them that exist, the better the samples returned to WLM and the more accurately the requirements can be assessed. If we have transactions that may have response times on the order of hours in addition to have only two or three running during any interval, the results available to WLM to effectively manage the response time of this workload is severely compromised. Response time, unlike velocity, is a singular event; it occurs only once in the life of a transaction.

When a transaction ends, it can contribute a value to WLM. However, until then, WLM is completely unaware of how well or poorly the work is performing. While WLM will have information gained from previously completed transactions, the ability to intervene in these “in-flight” transactions is severely restricted, since the outcome is unknown. If we take an example of three jobs that are defined with a response time of two hours, we can see the problem easier. Let’s also assume these jobs are staggered so they are expected to start and end at different times. Only after two hours can WLM determine whether the goal is being met. If the first job isn’t completed, then the goal has been missed but WLM can’t yet begin to assign resources to improve the response time because it isn’t yet a known quantity. However, even when the job ends, the problem in this case is that we’ll have exactly one sample to evaluate. In addition, the performance index can’t even be calculated since the work isn’t done, so no policy adjustment can be made until the response time interval completes. When the other jobs end, we’ll have consumed several hours of elapsed time and will end up with three samples from which to evaluate a response time policy.

While it can certainly be done, the practical outcome is that the ability of WLM to respond, especially if the service class can’t be kept continuously busy, results in an extremely long reaction time. In fact, given a steady flow of work, it may take days before WLM finds the precise resource mix needed to succeed. In short, response time goals are a useful goal setting when they are short enough to receive quick reaction times and voluminous enough to keep the feedback level to WLM high.

Percentile Response Times

Even with the caveats described previously, the precise control of transactions isn’t always possible. In some cases, there may still be legitimate outliers where the behavior of the transaction is sufficiently out of line so its existence skews the average. In these cases, it may be desirable to eliminate them from the calculation, so the use of a percentile goal is warranted. However, let’s be clear that an AVERAGE response time presumes some transactions will run longer than desired while others run shorter than desired, but they will be operating within reasonable ranges of each other so the average is within the defined goal. The concept of the outlier presumes the exceptional transactions are never offset and they must be eliminated because their influence on goal management is erroneous.

2 Pages