Aug 2 ’10
Storage Performance Management: More Balance to Improve Throughput
It’s important to pay attention to your storage configuration. Careful planning in evenly distributing data and workloads will yield better response times, more resilient operations, and higher throughput from your storage hardware. This article reveals the hidden influence of balance, which resources it impacts, and what balancing techniques you can use to improve performance and throughput without upgrading your hardware. We’ll discuss how storage tuning is uniquely different from processor tuning, and we’ll show that significant throughput and response time improvements may be possible with just a few well-chosen optimizations.
Having an unbalanced storage system can affect performance and cost. When hardware resources aren’t evenly loaded during peak periods, delays will occur even though the resources are more than sufficient to handle the workloads. The consequence could be that hardware is being replaced or upgraded unnecessarily, which is obviously a tremendous waste of financial and other resources. Unfortunately, this often happens because of the low visibility of the most important metrics for the internal storage system components. If you only look at the z/OS side of I/O, these imbalances can be hard to find, resolve, and prevent.
The mainframe performance perspective has always been that Workload Manager (WLM) optimizes the throughput in the z/OS environment by prioritizing work and assigning resources. This load balancing works well for identical processors in a complex. However, for storage, it’s a different story. The kind of optimization WLM performs simply isn’t possible for I/O since the location of the data is fixed. WLM can only manage the components that are shared, such as the channels and Parallel Access Volume (PAV) aliases. The internal disk storage system resources are mostly out of WLM’s control, and utilization levels of the internal components of the storage system hardware are unknown to z/OS and WLM, so work can’t be directed to optimize balance.
Let’s review how the level of balance on the major internal components of a disk storage controller influences the performance and throughput and how to create the necessary visibility to detect imbalances.
In a z/OS environment, front-end balance relates to the FICON channels and adapter cards. Most installations maintain a good balance between the FICON channels. z/OS will nicely balance the load between the channels in one path group and, with multiple path groups, most installations have ways to ensure each path group does about the same amount of work.
The less visible components here are the host adapter boards. Multiple FICON ports are attached to one host adapter board, and the host adapter boards share logic, processor, and bandwidth resources between ports. So, it’s important to carefully design the layout of the port-to-host adapter board configuration. Link statistics provide a good way to track imbalance. The load on each of the FICON channels is the same, but the links aren’t evenly distributed over the host adapter cards. The resulting differences in load on the host adapter cards negatively influence the response times for the links on the busiest cards (see Figure 1).
RAID Parity Groups
Redundant Array of Inexpensive Disks (RAID) parity groups contain the actual data the applications want to access. The throughput of a storage system largely depends on the throughput of the RAID parity groups. A common misconception is that a disk storage system with a large amount of cache hardly uses its disks because it does most of its I/O operations from cache or to cache. Although it’s true that under normal circumstances virtually all operations occur via cache, many of those operations do cause disk activity in the background. The only operations that don’t cause a disk access are the random read hits; all others do access the disks at some point. For instance, sequential reads, even though they’re mostly hits, must always be read from disk. As for writes, all writes are done to cache, but they need to be written to disk sooner or later, too. Moreover, for many of the current RAID schemes, a single write on the front-end causes more than one disk I/O on the back-end. For RAID 1 or RAID 10, a write takes two disk operations since all data is mirrored. For RAID 5, a random write takes four operations; for RAID 6, it even takes six operations because of the more complicated way parity updates work for these RAID schemes. Sequential writes are much more efficient on RAID 5 and RAID 6 than random writes, but they will still generate more than one back-end I/O per front-end I/O.
The key observation is that the back-end I/O rate is important and isn’t easily visible from the front-end I/O rate. Back-end peaks will likely be at a totally different time from the front-end peak, but possibly not even that much lower in terms of number of I/Os. Actual workloads differ significantly between installations; Figures 2 and 3 show some examples of back-end I/Os vs. front-end I/Os.
How does this all relate to balance and performance potential? The back-end operations are done to a particular RAID parity group. If active volumes are placed together on a single RAID parity group, whereas other RAID parity groups contain only inactive volumes, this most busy RAID group may run out of steam before any of the others do. As soon as this most busy RAID parity group reaches its maximum throughput limit, it will start responding slowly and all work to the other volumes on that RAID group will suffer, too. Likewise, an application that accesses one volume on an overloaded RAID group can encounter major performance issues even though most of the volumes it accesses are still fine. Therefore, having even only one highly busy RAID group may cause degraded application response times or longer batch periods. Ultimately, that may affect only a few batch jobs or, for example, it could cause a bank’s Automated Teller Machines (ATMs) to time out.
Therefore, the overall throughput potential of a disk storage system greatly depends on the balance you can achieve between the parity groups (see Figure 4). Both charts represent the same workload on the same hardware, but the balanced layout on the right shows a peak of 540 back-end I/Os instead of the 900 I/Os for the busiest RAID array on the left-hand side. This means the box could achieve a 66 percent higher throughput if everything was balanced evenly. The difference between the left and right chart is that the left-hand chart is the current situation and the right-hand chart shows the situation that would be achieved if the volumes had been placed to achieve the best balance possible.
A heat map is a useful tool for viewing the workload at the parity group level (see Figure 5). You will need a software package to determine the activity for each parity group for a prolonged period, and, with this, you can plot the activity over time. In a heat map, a hotter color (orange to red) indicates an overloaded parity group for a particular time.
It may not be intuitively clear how imbalance can impact cache usage. Let’s consider how.
Storage systems are provided with large amounts of cache memory to achieve a high number of read hits. However, cache isn’t just used for reads. Writes are also done to cache, and they even take priority over reads. Writes will tend to fill up the cache if they can’t be de-staged, causing a lower ratio of read hits than would be expected with the configured cache memory. So, despite large sizes, cache memory available for reads can be significantly reduced when there are bottlenecks in the storage configuration that delay the de-staging of writes from cache to disk. Ultimately, a “FW bypass” condition may occur, where a write operation is forced to wait until de-staging occurs before it’s acknowledged as completed to the host.
Since a FICON channel can send random write data much more quickly than a spinning disk can accept it, it’s quite possible to create a workload that will cause the writes to fill up the cache. In practice, those problems are most likely to occur in combination with flashcopy or shadow image technologies that require additional back-end operations for each new write.
While you may view decreasing read hit ratios and increasing FW bypass rates as a sign that there’s no longer enough cache, the real reason is that one or more of the back-end arrays can’t handle the de-staging load. Usually, it’s only a small number of arrays that are in trouble, so the easiest, cheapest, and most effective solution is to simply make sure random write activity is well-spread across arrays.
For replicated environments, you must take the back-end of the secondary storage system into account, too. Any write done on the primary system must also be done on the secondary. The secondary system therefore needs to be able to de-stage the requests in time to prevent the secondary cache from filling up with writes. If the secondary can’t keep up, new writes from the primary will be delayed and they will start to fill the cache on the primary side. This is why you must be particularly careful when deciding whether to select a more economical disk type on the secondary system compared to the primary.
Techniques to Optimize Throughput
There are several techniques to achieve a better balanced system with more throughput. Let’s review the major ones:
- Configuration of the storage system hardware: Spread logical volumes across more physical disks in one or more RAID parity groups. The larger the group, the more likely it is the work is more evenly spread. That’s why a RAID 10 configuration with eight disks in a parity group will perform better than a RAID 1 configuration, why 28D+4P provides a better balance than 7D+P, and why storage pool striping works well.
- Design of the SMS configuration: Use a storage configuration with “horizontal storage pools” across both parity groups and Library Control Units (LCUs). This way, z/OS and Data Facility System Managed Storage (DFSMS) load balancing tends to spread work across all parity groups.
- DFSMS features: Use software striping for highly active data sets so the work is spread over multiple, logical volumes in a storage group and most likely over multiple, physical disks. With just four stripes, you already have four times as many physical drives working on the I/Os, and the peaks are going to be much lower. Note that striping can be just as effective for a random access data set as for sequential access.
- Tuning: Actively tune the configuration by moving volumes away from “hot” RAID parity groups. Most installations do this with a manual review process, but this is a difficult task because of the many factors that must be considered. Existing software can recommend which volume moves are the best ones if you want to achieve and maintain a balanced configuration.
- Smart layout: When moving to new hardware, it’s important to make the layout as balanced as possible. For instance, distribute all FICON links and remote copy links as evenly as possible over all the host adapter cards, and spread the volumes over the RAID parity groups in a way that optimizes the workload balance. Again, it’s a tedious, difficult task to do this manually, but software can be used to find the optimal mapping for volumes over RAID parity groups.
Using a combination of these techniques, you will be able to create a well-balanced system and get more throughput and performance from your system without much effort. You may even be able to use higher-density disks or move from RAID 10 to RAID 5 without a performance penalty.
The way a storage configuration is balanced greatly influences its throughput and responsiveness. If there’s an imbalance between the components, delays can occur even though the hardware itself would be capable of handling the workloads. Using smart storage performance management techniques to achieve a well-balanced system can yield impressive results in both throughput and response times. With the right balancing efforts and software tools, storage hardware purchases may be postponed, saving a lot of money. If you manage storage performance wisely, it will directly translate into increased user satisfaction and lower hardware costs.