Linux machines will tend to fill all the memory they’re given. Memory not actually in use for the operating system or applications is used to cache pending data writes (“lazy writes”) and previously read data pages. As memory usage in Linux grows, the pending data writes and read cache are gradually flushed and the freed-up memory is used to satisfy operating system or application requests. Eventually, if all the memory is in use, Linux will swap. On a dedicated Intel platform, it’s always good for a Linux machine to have more memory:
• It can cache more data in normal times, avoiding I/O.
• It can run under higher load (flushing cache and probably doing a bit more I/O to reread data pages) before it needs to do slow, expensive swap I/O.
In a shared virtual environment, Linux memory usage for cache leads to more z/VM paging as the hypervisor must move its guest’s pages out of real memory to allow another to run. While the z/VM paging subsystem is fast (tens of thousands of pages per second is perfectly achievable), there’s overhead involved in this paging, and with Linux virtual memory sizes frequently reaching multiple gigabytes or more, it’s easy to wind up thrashing (spending more time paging than doing useful work).
The obvious question is, “So why would it ever be better to swap than to page?”; if it were a simple trade-off of one for the other, the answer would be, “It wouldn’t.” Key to understanding the issue is the previous discussion of how Linux uses memory for cache.
Like Linux, when z/VM needs to steal a page from one user to satisfy a requirement for another user, it’ll first take the oldest (least recently referenced) pages it can find. When Linux needs a page and takes one from cache, it’s thus almost guaranteed that the page will have been paged out by z/VM. So the Linux attempt to avoid I/O by caching data often fails to save anything, since z/VM must drive an I/O to page the data back in. In addition, between z/VM minidisk cache, controller cache, and Redundant Array of Inexpensive Disk (RAID) drive cache, data that’s truly frequently reused is likely to be cached somewhere anyway, so no real read from a spinning platter is required even without Linux read cache.
If, however, Linux has a smaller amount of memory, it won’t cache as much data, so its total memory footprint (virtual memory plus swap space actually used) will usually be significantly smaller. Smaller guests cause less paging. Linux will run out of available memory sooner and be forced to swap, but as z/VM switches between Linux guests, it’ll have to do much less paging, improving response time for all guests. So the total isn’t the sum of the parts. Since smaller virtual memory sizes avoid Linux caching, a smaller guest’s memory footprint (virtual memory size plus swap space used) is “lighter weight” than a larger guest.
So smaller virtual memory sizes improve overall response time at the expense of some swapping in guests. Swapping to z/VM minidisk is horrifically slow compared to memory access, even memory access with paging. However, Linux can use Virtual Disk (VDISK) in memory for its swap space. VDISKs are z/VM software constructs that appear as disks to guests, but are backed by pages managed in the z/VM paging subsystem. So when z/Linux swaps to VDISK, the swap I/O translates to z/VM paging I/O. This is more expensive for the guest than simply referencing memory, but for every swap I/O that occurs, dozens or hundreds of paging I/Os were saved, so it’s a “win” overall.
The result with smaller Linux virtual memory sizes and swap on VDISK is that instead of z/VM having to constantly page large guests in and out, it instead pages smaller guests in and out less frequently. Those guests’ performance is affected little by the additional I/O they must perform for pages they might otherwise have cached, and this is more than offset by the overall savings in paging.
Measured results support the benefits of this approach. Most z/Linux guests can swap to VDISK with little impact—and, when real system memory is overcommitted, with far better performance than larger guests. Exceptions exist: one such is the Oracle Shared Global Area (SGA), which must fit in Linux memory for reasonable performance.