Knowing the basics of processor architecture is useful for understanding IBM z/OS HiperDispatch Management; it helps you see what’s being improved, why it’s important, and what configurations aren’t as well-suited to benefit from z/OS HiperDispatch Management.
The Hardware Overview
Each Processing Unit (PU) in a configuration consists of several functional areas on each chip that perform specific actions with respect to instruction execution. Instructions must be fetched from central memory, decoded, operands retrieved, actions performed, and results stored. The IBM z10 is a superscalar processor, which means more than one instruction pipeline is being managed simultaneously, resulting in multiple instructions being processed simultaneously. For a more detailed explanation, refer to the IBM System z10 Enterprise Class Technical Guide (SG24-7516) and “IBM System z10 Performance Improvements with Software and Hardware Synergy,” K.M. Jackson, et al., IBM Journal of Research and Development, Volume 53, Number 1, July 2008.
Using this approach allows multiple instructions to be at various stages of preparation for execution; it minimizes delays that would occur if each step were performed only when it was actually detected. Since a central storage access takes about 600 machine cycles, if we assume one machine cycle per instruction, we’d need a queue of more than 600 entries just to ensure we didn’t delay the instruction execution by waiting for storage fetches to complete.
Adding higher speed memory for this kind of access was essential and prompted the use of Level 1 (L1), Level 1.5 (L1.5), and Level 2 (L2) caches that hold the instructions and data retrieved from central storage so necessary references will occur at the higher speeds. Each level of cache is used in the processor design to improve the access for the components it must communicate with (see Figure 1).
So, for example, if an L1 cache retrieval is one cycle and an L2 cache retrieval is 10 cycles, we can minimize the size of the instruction queue needed. Data that isn’t located in the L1 cache can ideally be retrieved from the L2 cache at a cost of 10 cycles. If the instruction queue is at least this size, we can be confident the instruction pipeline won’t stall as we wait for instruction preparation to complete. This is a radically simplified view of what occurs, but it illustrates the issues that must be addressed. There are many elements of processor design that take into account maintaining high cache hit ratios, but those considerations are outside the scope of this article.
System throughput depends on being able to maintain the references to data in the highest speed caches to minimize delays.
Cache Coherency Problem
The caches aren’t arbitrarily large, so the reference pattern of the data will determine how much stays in and what gets discarded. When a unit of work gives up control and later is re-dispatched, the highest benefit is obtained when the instructions and data previously referenced remain in the L1 cache. This avoids the need to re-retrieve anything, so performance is maximized.