Sep 1 ’03
Workload Manager: Revisiting Goals Over Time
The MVS Workload Manager (WLM) was first introduced with MVS/ESA 5.1. Since then, it has evolved to become one of the most integral parts of the z/OS operating system and to IBM’s overall strategy for self-managing, self-healing, and self-tuning systems. I remember when WLM was first introduced, many performance analysts were concerned their jobs would be eliminated or that they would lose control over system performance. Well, I can confidently say that neither has happened.
Over the years, I’ve worked with dozens of installations that were either migrating to goal mode or were already in goal mode and needed to fine-tune their WLM Service Definition or goals. I’ve worked with installations whose systems were having problems meeting WLM-defined goals, as well as with installations that were having problems managing their goal mode systems. My WLM classes are filled with students who want to get that sense of confidence in the z/OS WLM by understanding why it treats particular workloads certain ways.
Although no two situations or questions are ever the same, I have come to conclude that most, if not all, the phone calls, assignments, inquiries, and student questions fall into a set of scenarios and situations.
This article presents these scenarios from a high-level approach. If you recognize your installation, your z/systems, or yourself in any of these scenarios, then I recommend you put “Investigate possible changes to WLM Service Definition” on your project to-do list.
This article will not delve into methodologies for conducting goal mode evaluations or discuss how to determine what changes need to be made in a particular situation. Rather, this article will provide insight to WLM and help you structure your WLM responsibilities.
SCENARIO 1: IMPROPERLY SET GOALS OR WLM CONTROLS
Your migration to WLM goal mode was successful and compatibility mode is now a thing of the past. However, some goals have never regularly been met, or some work is not being managed as expected by WLM. Why is this? Could it be due to improperly set goals or incorrect service definition setup? The answer to this question is absolutely yes.
Installations had a variety of methodologies to choose from to help them migrate to goal mode. These included developing a service definition from scratch, using a sample service definition as a guideline, attempting to translate pre-existing IPS/ICS controls into a WLM Service Definition, or some combination of all three. Any methodology, mixed with a slight misunderstanding of WLM, made it possible to set improper goals and/or WLM definitions.
Although I have seen many different cases of improperly set goals and WLM definitions, some of the more common and easily explainable reasons include:
- WLM goals set to unrealistic expectations
- An overly easy response time or velocity goal
- The use of average response time goals when percentile goals were more appropriate
- The use of a response time goal when a velocity goal was more appropriate (or vice versa)
- Work being assigned an improper importance relative to the other work in the system
- Work being assigned a discretionary goal when it truly is not discretionary
- Too much work in SYSSTC or wrong work in SYSSTC
- Improper importance settings
- Improper use of resource group minimums and maximums
- Improper use of the storage or CPU critical controls
- Incorrect setup of service classes, periods, period durations, etc.
Rather than going into detail about each of these reasons, let me give you just one example. One of the most common incorrect settings I see is an overly aggressive velocity goal. Without going into too much detail about velocity, what it is, what it means, etc., let me just remind you that velocity is a function of the using and delay samples collected by the WLM sampler. The CPU using and delay samples have a heavy influence on velocity, so high delays could cause low velocities. Low velocities are not necessarily bad, and, in fact, may be perfectly acceptable and expected. Let’s take a high-level look at the example shown in Figure 1.
If a service class period running on a five-way processor has 30 units of work, and each unit wants to use the CPU concurrently, then during any given sampling interval, at most only five dispatchable units could be found using the CPU, and the remaining 25 units could be found delayed for the CPU. This does not necessarily mean the work is performing poorly. It could mean that delay is inherent in the workload. Given the characteristics of the workload and the physical processor resource, the high delay for CPU is expected. In this example, because delay is inherent in the workload, a velocity of 60 may be too aggressive, and a velocity of 10 may be more realistic.
SCENARIO 2: LATELY, GOALS ARE REGULARLY BEING MISSED
Your migration to WLM goal mode was successful, and you finally overcame your improperly set goals and service definition controls. Everything has been running well, but now some time has passed. Lately, WLM seems to be managing the system and the workloads a bit differently than it did previously. Why is it that some goals that used to be met regularly are now being missed? Why is it that over time WLM manages the work differently, or the results of WLM management are different? Why do service definitions and goals need to be revisited regularly?
The answer to this is simple: Over time, the system, workloads, applications, and even the users change. Given a finite amount of resources, if a workload grows or changes, then it could result in WLM controls that are no longer appropriate for the current environment.
As with the previous scenario, this scenario has many possible causes:
- Growth in an existing workload or application
- Growth in SYSSTC, system address spaces, and/or monitors
- Changes in the capacity or configuration of the hardware
- Server/image consolidation
- Changes in software product levels or applications
- Growth in system address spaces
- A reduction in a workload
- Introduction of a new workload.
Allow me to elaborate on one simple cause: growth in an existing workload, which may cause that workload to consume more system resources. On systems where resources are plentiful, growth of a workload may not impact the performance of the other workloads.
On systems with a shortage of resources, WLM tries to ensure that the resources are allocated to the highest importance workloads as needed. If a workload that has grown requires more resources, and is assigned a higher importance level than other workloads, then WLM may decide to take the required resources away from work at lower importance to give to the grown workload. In the past, these lower importance workloads may have had no problem achieving their objectives, but with this new distribution of resources, those same, unchanged, lower importance workloads may now miss their goals.
The usual indicator that this scenario is occurring is that you will start seeing higher Performance Indexes for the lower importance workloads with an increase in transaction or resource consumption by the corresponding higher importance work.
The lesson here is to remember to regularly revisit and re-evaluate your goals over time.
SCENARIO 3: CHANGES TO THE SYSTEM ENVIRONMENT ARE PLANNED
Your systems are running great in goal mode and everyone is happy. However, your site is planning some environmental changes.
When planning a change to the workload’s environment, do the WLM Service Definition and assigned goals need to be considered? Once again, the answer is yes.
You do not need to micro-manage your goals or WLM Service Definition for every planned change to capacity, software level, or new software functionality. What I am saying is that if you are planning an environmental change that may affect the expected performance of the workloads, you need to at least consider the possibility that the change could affect the way WLM views and manages the workloads and resources, or that the environmental change may cause additional contention for resources.
Some of the many environmental changes that could affect the way WLM manages the workloads and system resources include:
- Certain changes in processor capacity
- Changes to LPAR definitions or configurations
- Changes in Sysplex configuration or technology
- Workloads introduced into an asymmetrical Sysplex
- Consolidation of servers or workloads
- Merging of two data centers or companies
- Exploitation of some of the newest DASD I/O subsystem technologies.
There are some additional areas, but this list should give you a good feel for the types of changes you should consider. Again, I am not saying to micro-tune your WLM Service Definition for all changes, but make sure you consider the possible effects to WLM management of workloads and system resources if you are considering any of these environmental changes:
As an example, lets refer back to Figure 1, which shows a service class period that had 30 units of work that all want to run concurrently. As I mentioned, velocity goals are entirely dependent on the using and delay state samples that are collected by WLM to assess the progress of work assigned the velocity goals. An increase/decrease in the number of processes could cause more/less using CPU samples and more/less CPU delay samples. This, in turn, will affect the velocity achieved by some workloads, but may not affect others. Some goals may become too easy, and others may become too aggressive.
Now let’s say that you decide to install a processor with the same number of CPUs but each a faster speed. In this case, since the work would be processed faster, you would expect to see less delay samples, which would result in an increase in expected velocity. The same would be true if you kept the speed of the processors the same, but increased the number of processors. If you took this same workload and ran it on slower processors, you would likewise expect the velocity to decrease. Velocity is a goal that is very sensitive to both the speed and number of CPUs processing the workload.
This scenario becomes especially interesting when a workload runs on multiple z/OS images in an asymmetrical Sysplex (see Figure 2). If velocity goals are sensitive to the speed and number of CPUs, and since goals are Sysplexwide, then what sort of goal is appropriate for a workload running in a Sysplex with both a small/slow machine and a big/fast machine? Should you assign an aggressive velocity goal to cater to the workload running on the big/fast machine? If you do this, you risk having a goal that is too aggressive for the workload when it is running on the small/slow machine. If you assign the workload a lower velocity to cater to the workload running on a small/slow machine, then the goal may be too easy for WLM on the big/fast machine. Each of these cases has implications on the way WLM manages the workloads and their performance.
SCENARIO 4: DESIRE TO EXPLOIT ADDITIONAL WLM FUNCTIONS
Your systems are running great in goal mode, but now you want to take advantage of some WLM functions that you have not yet exploited. Perhaps these are functions introduced with a new release of z/OS, or maybe they are existing functions that you just have not yet had the opportunity to try out. Will the exploitation of various WLM functions that are not already being exploited have an influence on the assigned goals or WLM Service Definition? The answer is a simple yes. Since WLM was first announced, there has been a steady stream of enhancements. Enabling or exploiting most of these new functions will change the way WLM manages the workloads and resources. Because of the power of these enhancements, the WLM design team has made most of them optional. Naturally, it follows that if you want to exploit any of these new functions, you have to change your WLM Service Definition.
I caution you that enabling any of these functions does require investigation, thought, and a before-and-after assessment of the effect any change has on goals, workloads, and WLM management. Figure 3 shows a list of WLM enhancements and the categories in which they fall. Although few are required, all deserve consideration, and all either benefit or hurt your workloads. There are other enhancements, but for brevity, I listed the ones that I find most interesting.
SCENARIO 5: CHANGES TO WLM, SYSTEM PROBLEMS, IMPROPER TUNING
Your performance monitors indicate that some response time goals are being missed due to high response times, and work with velocity goals is showing abnormal velocities. Some users are calling and complaining, or maybe the systems seem “sluggish.” However, your performance monitors indicate that WLM is doing the best it can! The dispatch priorities seem to be correct; there is no paging, and WLM delay samples are nearly non-existent. What is going on?
Why is it that WLM appears to be doing its best, but work is still not performing well, or goals are being missed? Are changes required to the WLM controls?
The key to this scenario is to remember that WLM may not be able to meet goals on systems that are out of capacity or for systems that are not tuned. WLM is only going to help in one aspect of workload management. You still have to tune your systems, Sysplexes, subsystems, and workloads.
No matter how well-intentioned a WLM goal is, it is only an objective for WLM to alleviate delays it knows about using controls over which it has control. In other words, workloads could be missing their goals due to reasons beyond WLM’s control. For example, if a workload is missing its response time goal because it is spending too much time in the coupling facility, WLM is currently not capable of tuning the coupling facility or structures to help this work meet its objectives. Therefore, no matter how hard WLM tries, it may not be able to help this workload if its real problem is an improperly tuned coupling facility or structure.
Common delay reasons beyond WLM’s scope of control include:
- Improperly tuned subsystems or workloads
- Improperly tuned Sysplex facilities such as coupling facility, structures, Cross System Coupling Facility (XCF), etc.
- Poorly tuned databases, applications, or database calls
- Poorly tuned z/OS systems or improper system setup
- Insufficient hardware capacity or poor I/O configuration.
The lesson here is to make sure that your systems and Sysplexes are tuned, and that your WLM controls take into account the time the workloads spend in states that are beyond WLM’s control and management. WLM goals and controls should always reflect the reality of the system, and it is still up to you to provide the workloads the best reality possible.
SCENARIO 6: BUSINESS PRIORITIES AND OBJECTIVES ARE CHANGING
In this scenario, all is running great in goal mode, but you’ve found out the business priorities and objectives are going to change. Perhaps workloads that used to be very important will no longer be as important. Maybe you want the work to achieve a different objective. Maybe an entirely brand-new key workload is being introduced.
When the business priorities change, do the WLM Service Definition and assigned goals need to be considered? The answer is yes. WLM goals and importance levels play a key role in the way WLM is going to manage the system resources and workloads. It is important for you and WLM to prioritize workloads relative to each other. WLM attempts to meet higher importance goals before trying to meet lower importance goals. Naturally, if the business priorities of the workload change, then importance levels and goals need to be revisited.
The most common examples I see for this scenario include:
- The business objectives change
- Merging of two companies/data centers
- Data center consolidation
- Consolidating workloads from multiple images to few images
- Introduction of new workload
- Server consolidation.
If your installation is planning any of these changes, then you need to consider revisiting and re-evaluating your WLM Service Definition, goals, and importance levels.
I recommend that you maintain an easily viewable version of your WLM Service Definition. In fact, you may want to convert your WLM Service Definition to HTML format and post this HTML page to some department Website so everyone on the performance team is familiar with the service definition.
If you want a simple way to convert your WLM Service Definition to HTML, feel free to visit my Website at www.epstrategies.com, select the button titled “WLM to HTML,” and you will be instructed on how to do this conversion. It is simple, easy, and the result will be very useful. If nothing else, you will see it is a much easier way of reading your WLM Service Definition than maneuvering through the WLM ISPF panels.
SCENARIO 7: INACCURACY OF REPORTED MEASUREMENTS
In this scenario, all seems to be running great in goal mode. The users are happy, your manager is happy, and life is good. However, when you look at your performance monitors, it sometimes appears goals are not being met. Or, maybe the measurements don’t even make sense to you. Could the measurements be “wrong”?
Why is it that sometimes the measurements don’t reflect reality? Could (and should) changes be made to the WLM Service Definition to make the measurements more accurate?
First, please let me clarify the title I’ve given to this scenario. When I refer to “inaccuracy” of measurements, I don’t mean to imply that the measurements are intentionally wrong and are just not being fixed by the vendors of your performance monitors. What I am referring to is that sometimes the measurements reported by monitors don’t always reflect reality. This, in turn, can lead to some misunderstanding and interpretation of the measurements. Without understanding the internals of WLM, this is a difficult scenario to comprehend. Let it suffice to say that WLM’s management of certain workloads causes it to manage workloads outside the defined WLM controls that you’ve set. This, in turn, sometimes results in the reports being misleading or difficult to understand.
The most common cases I see for this scenario include:
- WLM management of CICS or IMS workloads toward transaction goals
- Improper setup of CICS and IMS transaction goals
- WLM management of exploiters of enclaves (such as WebSphere, Stored Procedures, DDF)
- Mixing of unlike work in the same period
- What I term as “participant address spaces” being classified to SYSSTC
- Short, response time workloads mixed into the same periods as long-running address spaces or enclaves.
SCENARIO 8: PLANS TO EXPLOIT NEW NON-WLM FUNCTIONS THAT WILL AFFECT PERFORMANCE
All is running great in goal mode, but now your installation is planning to take advantage of some new non-WLM functions that may affect the performance of the workloads. Not only that, but some of these changes may affect the software bills.
How do we manage what may be conflicting objectives? Do the new non- WLM functions lead us to consider making changes to our WLM definitions? Is there a connection between pricing and WLM?
Examples of some such facilities include:
- Intelligent Resource Director (IRD)
- On/Off Capacity on Demand (COD)
- Workload License Charges (WLC).
Al Sherkow, of I/S Management Strategies, Ltd., has previously written for z/Journal on the subjects of IRD and WLC. Visit the z/Journal Website at www.zjournal.com to view these articles, or visit Al’s Website at www.sherkow.com for some great presentations and papers on these subjects.
As Al has pointed out, each of the aforementioned items could affect system capacity. As I mentioned previously, if the capacity of the processors changes, then it is entirely possible for the workloads to achieve different velocities or transaction response times. With IRD and WLC, we now have facilities that may change the system capacity multiple times, dynamically, throughout the day.
How will these affect the achievement of goals and will goals need to be modified? If a partition has been capped due to IRD or WLC, we would expect to see the impact first on the lower importance workloads since they are the first ones sacrificed when resources become scarce.
Al Sherkow has developed a new and interesting concept of “expendable MSUs.” Al realized that when a partition is capped due to an LPAR’s defined capacity being exceeded, additional capacity is taken first from the CPU of the lower importance workloads. Al suggests that if you want to realize further savings in your software bill by pushing the defined capacity value even lower, you can do so by first understanding, and then accepting, the impact to the lower importance workloads. Only you know which workloads are expendable, and you will need to make sure your WLM importance levels are set appropriately.
It is still premature to garner the full implications of WLC, IRD, and COD on WLM goals, since much still needs to be learned about their interactions.
SCENARIO 9: OCCASIONALLY “SOMETHING STRANGE HAPPENS” OR “DOESN’T HAPPEN”
In this scenario, all is running great in goal mode, but sometimes the system just acts abnormally. It is hard to articulate, but it appears that WLM “burps,” or does not manage the workloads as expected. You have read the WLM books and talked to the WLM experts, but still you cannot understand why WLM is managing the workload as it is. There is usually no clear explanation.
What could be happening? Is there a problem or “hole” in WLM? Should changes be made to the WLM Service Definition to avoid these anomalies?
The typical cases for this scenario vary greatly, but usually happen during the following times:
- When trying out a new facility
- When implementing a new or changed workload
- During a system ABEND or dump
- When the system is under a lot of stress.
Examples of cases I’ve see include:
- Unexpected dispatching priorities
- Resource group minimums or maximums are not honored
- High WLM overhead
- Goals are not met but no other scenario applies
- Importance levels seem to be ignored.
I’ve studied many of these cases, and no two are the same. However, you will know you are experiencing such a scenario when it occurs. When this occurs, don’t despair. My suggestion is to post a question to a z/OS performance-oriented list server, contact IBM service, or even contact me. I enjoy looking at these cases, since they teach me more about WLM. Remember, WLM is software, and it is designed not to do anything “stupid.” Having worked closely with the designers of WLM, I can tell you they are toptier, and they’ve put a great deal of thought into many different cases and scenarios. However, this does not mean that WLM is perfect.
But it is pretty neat ... isn’t it?
WLM is one of the most interesting areas of the z/OS operating system in which you can become involved. So, good luck, and have fun. Z