Operating Systems

Member State Tracking and Management

Several of the preceding examples refer to exchanging information among members to coordinate access rights to system resources. To accomplish this reliably, the system must ensure that all running members are included in the conversation, and must therefore serialize such interaction with members joining or leaving. The SSI Manager component of the z/VM Control Program tracks member states and orchestrates these transitions. Since some of the problematic cases involve loss of communication among the members, message passing alone won’t suffice. The SSI Manager also uses a reserved area on a shared disk, called the Persistent Data Record (PDR), to track member states and attributes and to maintain “heartbeats” of running members. Because members use atomic channel programs to lock, read, and update the PDR, it must reside on an Extended Count Key Data (ECKD) rather than a SCSI device.

At Initial Program Load (IPL) of an SSI member system, the member marks its state as “joining” in the PDR. (If another member is already joining or leaving, this member will wait until that operation completes. This allows members to deal with such changes one at a time.) The joining member establishes communication with each joined member and signals its intent to join. All members then mark the cluster mode as “in-flux.” Virtual servers continue to run as usual, but requests to allocate new resources (e.g., to log on a virtual server, link to a minidisk, or obtain a virtual MAC address) are held pending during this period.

Next, the various SSI-enabled services within z/VM exchange messages to verify configuration consistency and reach a common view of resource ownership and states. For example, members exchange lists of logged-on guests and virtual MAC address allocations to prevent duplication. They also send lists of minidisks to which they hold write links, so recipients can disable minidisk cache as needed. Likewise, virtual MAC address allocations are exchanged to prevent duplicate assignment. The instances of the SSI Manager component coordinate this joining process and allow it to complete only after all the SSI services have signaled successful state synchronization. At that point, the new member enters “joined” state and the cluster returns to “stable” mode. Now that all members agree on the membership list and on the baseline resource state, any pending re-source allocation requests (logons, links, etc.) can proceed.

A similar process occurs during member shutdown. The member signals its intent to leave the cluster; its fellows acknowledge this and delete records of resource allocations to that member. While the member is in “leaving” state, the cluster is again in flux, and resource allocation is suspended until updates are complete and the member is marked “down.” If a member terminates
abnormally (abends), it marks itself down in the PDR. Resource allocations on the remaining members may hang briefly, but shortly the other systems will recognize the down state in the PDR and recover. As a last resort, if the failed member stops suddenly and can’t even update the PDR, an operator command can be used on a surviving member to declare the victim down, so normal operations can resume.

Figure 2 depicts the member state and cluster mode transitions previously described. It also shows some exception states reached if errors occur. If an attempt to join at IPL time fails, the member enters “isolated” state; it’s running, but has no access to shared resources. This allows correction of the configuration error that led to the isolation. If communication among the members is lost, for example, due to a hardware failure or to the sudden system stoppage as previously described, then each member enters “suspended” state and the cluster enters “safe” mode. Unlike in-flux mode, safe mode is a condition that won’t resolve itself. Human intervention is needed either to restore connectivity or to shut down the inaccessible member. While in safe mode, resource allocations aren’t merely delayed, but rejected. When communication is restored throughout the cluster, members go through a rejoining process to resynchronize state (e.g., to adjust for virtual machines that have logged off or disk write links that have been detached in the interim), and then the cluster returns to stable mode.

In all cases, the overarching design point is to preserve system integrity by ensuring that all members agree on who has access rights to each shared resource.SSI System Administration Advantages z/VM's SSI feature offers system administrators many opportunities to decrease costs and eliminate downtime for system maintenance. It allows up to four systems to be administered together, sharing resources and software repositories.

Disk Layout

To facilitate these administration savings, the way data are laid out on system volumes has changed dramatically for z/VM 6.2.0 (see Figure 3). The disks are now split between clusterwide (also called shared or common) and system-specific (also called member-specific or non-shared). On the shared volumes reside common files for administration, such as the SYSTEM CONFIG and USER DIRECT, and the release-specific disks containing all the z/VM 6.2.0 information. Member-specific volumes encompass the system residence (RES) pack—with the checkpoint areas, system nucleus (CPLOAD) file, and production CMS minidisks such as MAINT 190—as well as paging, t-disk, and spool packs. The division of disks between the VMCOM1, 620RL, and RES packs applies even to non-SSI installs. This supports easy conversion of a non-SSI system to SSI and migration to future releases.

Installation

Maintenance savings begin at install time. The updated install process lets you specify an SSI cluster of one to four members. An SSI cluster must be installed to 3390 DASD, although non-SSI installs may still be made to SCSI. SCSI systems may be easily converted to SSI after install through a process documented in Chapter 28 of CP Planning and Administration. Naturally, installation of so many systems requires some extra planning. The install process requires that the member names, shared and non-shared DASD, and CTC addresses be specified for all systems. For ease-of-use, IBM recommends using the same device numbers to refer to the same devices on all members.

4 Pages