In the past, when looking at Linux on the mainframe, we considered Linux on S/390 or zSeries as simply another architecture. The S/390 platform is a 32- bit big-endian processor, and therefore, Linux on zSeries is conceptually no different from Linux on Alpha, SPARC, or PowerPC. The only areas in which there are significant differences are within the kernel and device drivers, and very few users (or even system administrators) need to be concerned with architectural features.
If you’re running z/Linux in Basic mode or in an LPAR, this is very close to the whole story. However, if you’re running Linux under VM’s control (as most sites running z/Linux are), then you’re not making full use of the platform’s capabilities. This article will examine how you can leverage z/VM to get the maximum utility out of your Linux guests. The tips that I’ll present are useful for penguin colonies of roughly 10 machines or more; for sites with less than 10 machines, there is no real incentive to tune the guests for optimum use of shared resources, as the amount of sharing potential that can be realized is small.
Virtual and Real Resources
The first thing to remember is that VM knows about the actual physical resources available to the system, and an individual Linux guest does not. A guest has no idea that it is potentially sharing a finite set of resources with many other guests. This leads us to one of our basic guiding principles: Whenever possible, we should let VM do the heavy lifting, particularly in areas such as moving processes or guests in and out of storage. Linux evolved in a CPU-rich, I/O-poor environment; exactly the opposite of most zSeries environments.
With that in mind, let’s examine how Linux/390 behaves when it is one of many virtual machines, some of which may also be running Linux. The constrained resources on the box are almost certainly CPU cycles and total physical memory. Anything we can do to reduce those resources for a set of Linux guests will be a victory. Another area where we can realize great economies of scale is with file systems; although Linux guests typically have large file systems, only a small part of the file system has to actually be writeable. Since most guests are pretty much the same, if we can share a single read-only file system between them, this will vastly decrease our overall DASD usage.
Removing the Timer Pop
The standard Linux kernel generates an interrupt at 100Hz; this is simply the kernel waking up and incrementing a counter. While this is fine in a situation where Linux owns all the hardware — this is such a tiny amount of activity that it doesn’t matter in terms of overall workload — it is a much larger issue when you have many Linux guests running under z/VM. First, a tiny workload multiplied by a large number of guests can turn into a substantial workload (for Test Plan Charlie, in which 41,400 Linux guests were run on a single S/390, Sine Nomine president, David Boyes, worked around this by setting HZ=10 in the kernel; this clobbered interactive performance, but timer interrupts reduced system activity by an order of magnitude). Second, since each guest is doing something every 100th of a second, none of the guests ever get moved out of the active queue under VM. Therefore, the Linux guests effectively conspire to starve other guests at the same priority on the system.
However, since early in the 2.4 kernel series, IBM has supplied a patch that will turn off the timer; a guest will fake its timer tick information based on the system’s ToD clock, and guests do not interrupt the VM Control Program (CP) every 100th of a second to announce themselves. Several distributions include this patch as an option, and if you are building your own kernel (as you will have to do to exploit Named Save Segment [NSS] support, described in the following section), you certainly should set this option. In the kernel .config file, this consists of setting:
If you do a “make menuconfig” or a “make xconfig” to build your kernel, you will want to make sure that “No HZ timer tick in idle” is selected, and “Idle HZ timer on by default” is deselected in the “General Setup” section of the kernel parameters.