May 8 ’12

Flexible Horizontal Scalability With z/VM Single System Image

by Emily Hugenbruch, Damian Osisek in z/Journal

In late 2011, IBM's premier virtualization product for System z debuted a new feature, z/VM Single System Image (SSI), which dramatically improves the horizontal scalability of z/VM workloads. Available in z/VM Release 6.2, SSI opens the door to easier, more flexible horizontal growth of large-scale virtual server environments by clustering up to four z/VM systems—each capable of running hundreds of virtual servers—in a rich, robust, shared-resource environment. Coordination of access to shared devices and network connections between members, as well as a common repository of virtual server definitions, allows workload to be spread seamlessly across the cluster. Capabilities such as multiple-system installation, a shared software service repository, and single-point-of-control administration and automation reduce systems management cost and effort. Capping all these features is Live Guest Relocation LGR), the ability to move running Linux virtual servers (guests) from one member to another without disruption, to redistribute workload and provide continuous operation through planned system outages.

SSI Configuration

An SSI cluster comprises up to four z/VM member systems, running on the same or different System z machines, interconnected through shared networks and through access to shared disk storage (see Figure 1). Two types of connectivity are employed. A common Ethernet LAN segment supports traffic among the virtual servers and to the outside world. Channel-To-Channel (CTC) adapters flow system control information among the z/VM member hypervisors, which are driven by a significantly enhanced Inter-System Facility for Communication (ISFC) component in z/VM. The improved ISFC can group up to 16 CTCs into a logical link, providing high bandwidth, reliable transport of system data for functions such as coordination of member states, negotiation of resource access, and LGR. Most disk storage is shared throughout the cluster to ensure that both virtual server data and system metadata are accessible from all member systems. Optionally, a common Fibre Channel Protocol (FCP) Storage Area Network (SAN) allows access to Small Computer System Interface (SCSI) devices from all hosts and guests throughout the cluster.

Safe, Controlled Resource Sharing

The ability to distribute workload as desired across the SSI cluster depends on uniform access to numerous system resources:

• A common system configuration (SYSTEM CONFIG) file defining the SSI membership, CTC connections, system volumes, virtual network switches, and other attributes for each member in the cluster. Most of these attributes are specified once to apply to all cluster members. The syntax allows specification of system-specific characteristics where needed; for example, to define each member's paging and spool volumes.
• A common user directory defining all the virtual machines, their CPU and memory configurations, privileges, and the real and virtual I/O devices to which they have access. Changes to this directory are propagated throughout the cluster by a directory manager product such as the IBM Directory Maintenance Facility (DirMaint).

By default, virtual servers (guests) defined in the directory can be instantiated (logged on or autologged) on any one of the member systems; they will automatically gain access to the resources defined for them in the directory. To prevent a duplicate presence in the network or conflicting access to a guest's resources, the instantiating member confers with other members to ensure the guest isn’t already running elsewhere. Management and service virtual machines that need to operate on each member of the cluster can be exempt from this restriction, as we will explain.

• Disk volumes defined for z/VM system use (CP-owned volumes) are tagged with the name of the SSI cluster and, if appropriate, the owning member. This ensures that configuration errors, such as naming the same paging volume for multiple systems, won’t result in corruption of one member's data by another.

Volumes containing virtual server data (full packs or minidisks) are generally accessible across the cluster. The minidisk “link mode” semantics that z/VM uses to govern concurrent read or read-write access to these virtual disks are now enforced throughout the cluster. In addition, z/VM's Minidisk Cache (MDC) function no longer must be disabled for shared volumes; rather, each member disables and re-enables MDC automatically on each minidisk as write access on another member is established and removed. For shared minidisks that are seldom updated, this allows all members to benefit from caching safely when the contents aren’t changing.

• Spool files created on any member are accessible on all other members. Each member owns a separate set of spool volumes on which it allocates files. The replication of spool file metadata among members allows a guest on one member to access all files in its queues, even files residing on volumes owned by another member.
• Virtual network switches (VSWITCHes) can be defined in a single place—the system configuration file—and these definitions (name, backing physical network device, and attributes such as network type and VLAN options) apply across all the members. Traffic is routed transparently among the members to give the appearance of a single VSWITCH across the cluster. This allows guests connected to the same VSWITCH to interact seamlessly regardless of the members on which they’re running or to which they’re
relocated.

Media Access Control (MAC) addresses assigned to virtual network devices are managed across the cluster. The system configuration defines a separate range of addresses from which each member will allocate. Guests carry their addresses with them when they’re relocated. Since the member that assigned the address may have been re-IPLed and “forgotten” prior assignments, an address to be allocated is first broadcast to the remaining members to ensure there’s no conflict.

Member State Tracking and Management

Several of the preceding examples refer to exchanging information among members to coordinate access rights to system resources. To accomplish this reliably, the system must ensure that all running members are included in the conversation, and must therefore serialize such interaction with members joining or leaving. The SSI Manager component of the z/VM Control Program tracks member states and orchestrates these transitions. Since some of the problematic cases involve loss of communication among the members, message passing alone won’t suffice. The SSI Manager also uses a reserved area on a shared disk, called the Persistent Data Record (PDR), to track member states and attributes and to maintain “heartbeats” of running members. Because members use atomic channel programs to lock, read, and update the PDR, it must reside on an Extended Count Key Data (ECKD) rather than a SCSI device.

At Initial Program Load (IPL) of an SSI member system, the member marks its state as “joining” in the PDR. (If another member is already joining or leaving, this member will wait until that operation completes. This allows members to deal with such changes one at a time.) The joining member establishes communication with each joined member and signals its intent to join. All members then mark the cluster mode as “in-flux.” Virtual servers continue to run as usual, but requests to allocate new resources (e.g., to log on a virtual server, link to a minidisk, or obtain a virtual MAC address) are held pending during this period.

Next, the various SSI-enabled services within z/VM exchange messages to verify configuration consistency and reach a common view of resource ownership and states. For example, members exchange lists of logged-on guests and virtual MAC address allocations to prevent duplication. They also send lists of minidisks to which they hold write links, so recipients can disable minidisk cache as needed. Likewise, virtual MAC address allocations are exchanged to prevent duplicate assignment. The instances of the SSI Manager component coordinate this joining process and allow it to complete only after all the SSI services have signaled successful state synchronization. At that point, the new member enters “joined” state and the cluster returns to “stable” mode. Now that all members agree on the membership list and on the baseline resource state, any pending re-source allocation requests (logons, links, etc.) can proceed.

A similar process occurs during member shutdown. The member signals its intent to leave the cluster; its fellows acknowledge this and delete records of resource allocations to that member. While the member is in “leaving” state, the cluster is again in flux, and resource allocation is suspended until updates are complete and the member is marked “down.” If a member terminates
abnormally (abends), it marks itself down in the PDR. Resource allocations on the remaining members may hang briefly, but shortly the other systems will recognize the down state in the PDR and recover. As a last resort, if the failed member stops suddenly and can’t even update the PDR, an operator command can be used on a surviving member to declare the victim down, so normal operations can resume.

Figure 2 depicts the member state and cluster mode transitions previously described. It also shows some exception states reached if errors occur. If an attempt to join at IPL time fails, the member enters “isolated” state; it’s running, but has no access to shared resources. This allows correction of the configuration error that led to the isolation. If communication among the members is lost, for example, due to a hardware failure or to the sudden system stoppage as previously described, then each member enters “suspended” state and the cluster enters “safe” mode. Unlike in-flux mode, safe mode is a condition that won’t resolve itself. Human intervention is needed either to restore connectivity or to shut down the inaccessible member. While in safe mode, resource allocations aren’t merely delayed, but rejected. When communication is restored throughout the cluster, members go through a rejoining process to resynchronize state (e.g., to adjust for virtual machines that have logged off or disk write links that have been detached in the interim), and then the cluster returns to stable mode.

In all cases, the overarching design point is to preserve system integrity by ensuring that all members agree on who has access rights to each shared resource.SSI System Administration Advantages z/VM's SSI feature offers system administrators many opportunities to decrease costs and eliminate downtime for system maintenance. It allows up to four systems to be administered together, sharing resources and software repositories.

Disk Layout

To facilitate these administration savings, the way data are laid out on system volumes has changed dramatically for z/VM 6.2.0 (see Figure 3). The disks are now split between clusterwide (also called shared or common) and system-specific (also called member-specific or non-shared). On the shared volumes reside common files for administration, such as the SYSTEM CONFIG and USER DIRECT, and the release-specific disks containing all the z/VM 6.2.0 information. Member-specific volumes encompass the system residence (RES) pack—with the checkpoint areas, system nucleus (CPLOAD) file, and production CMS minidisks such as MAINT 190—as well as paging, t-disk, and spool packs. The division of disks between the VMCOM1, 620RL, and RES packs applies even to non-SSI installs. This supports easy conversion of a non-SSI system to SSI and migration to future releases.

Installation

Maintenance savings begin at install time. The updated install process lets you specify an SSI cluster of one to four members. An SSI cluster must be installed to 3390 DASD, although non-SSI installs may still be made to SCSI. SCSI systems may be easily converted to SSI after install through a process documented in Chapter 28 of CP Planning and Administration. Naturally, installation of so many systems requires some extra planning. The install process requires that the member names, shared and non-shared DASD, and CTC addresses be specified for all systems. For ease-of-use, IBM recommends using the same device numbers to refer to the same devices on all members.

Figures 4 and 5 show some of the new install panels. In Figure 4, the member names are specified, as is the choice of first- or second-level install. Second-level installs are particularly useful for current z/VM users who wish to try out SSI; they don’t require physical CTC connections or shared DASD. Specifying the LPAR name (or first-level userid) at install time is new for 6.2.  The System_Identifier statement in the SYSTEM CONFIG was updated to allow the member name to be mapped to LPAR name, eliminating the need to know model numbers and CPU IDs.

 

Next, the DASD for all four members is identified. Shared and non-shared DASD must be specified here. If different device numbers are used to refer to the common disks, this will be addressed in the next screen.

Figure 5 shows the next install screen for first-level installs. Up to two CTC devices between each member may be specified at install time; others may be added to the SYSTEM CONFIG later. With the second-level install option, virtual CTCs are used and install will create a file with the first-level userids' directory entries and PROFILE EXECs.

For users coming from existing z/VM systems, IBM recommends performing a non-SSI installation to upgrade to 6.2 and then converting to SSI. CP Planning and Administration offers several “use-case scenarios” in Chapters 28 to 33 that detail how to convert various types of systems to SSI clusters.

Service

Once systems are converted to SSI clusters, the savings continue. In an SSI cluster, a single set of minidisks holds the service repository for a release, so applying service is a snap. The Recommended Service Upgrades (RSUs) are applied to the shared 620 disks. Then PUT2PROD may be issued on each member at the administrator's convenience to place the new service level into production. As service and new releases come out, they will be backward-compatible. Every member in an SSI cluster could be running a different set of service, but SSI communications and LGRs will still function smoothly. This gives system administrators great flexibility in scheduling when the service is put into production on each member. They can wait for a convenient downtime, or use LGR to move vital servers before the maintenance is applied.

System Configuration File and User Directory

As mentioned previously, some administration files are now clusterwide, such as SYSTEM CONFIG and USER DIRECT. These reside on the VMCOM1 shared DASD. The System_Identifier statement begins the SYSTEM CONFIG. From there, each statement is either member-specific or shared. Statements such as System_Residence have member qualifiers around them, while others, such as the PRODUCT statements and perhaps some RDEV and VSWITCH statements, may be common to all members. The CP_OWNED and USER_VOLUME_LIST statements are split between common and member-specific, depending on the type of DASD. Chapter 25 of CP Planning and Administration provides many additional examples of how these statements would look.

The USER DIRECT has many updates for SSI, too. Guests are now divided into two categories: multi- and single-configuration. Single-configuration virtual machines may only be logged on to one member at any time and are identified by the USER keyword. Multiconfiguration virtual machines may be logged on to multiple members simultaneously. Their directory definitions are in two
sections. One part, under the IDENTITY keyword, contains statements common to the guest across the cluster, including the userid, password, and all privileges and authorizations. The other part, under the SUBCONFIG keyword, contains statements that only apply when the guest is logged on to a particular member, such as distinct read-write minidisks for each instance of the multiconfiguration virtual machine. These two parts are linked together via a BUILD statement (see Figure 6).

Many of the familiar guests in the IBM-supplied directory are now multiconfiguration virtual machines, such as MAINT, OPERATOR, and OPERATNS. These are for member-specific maintenance tasks. There are also some guests that are single-configuration virtual machines, such as PMAINT and MAINT620, which are for clusterwide maintenance. The old duties of MAINT, for example, have been split among MAINT, PMAINT, and MAINT620. PMAINT now owns the CF0 parm disk, where the SYSTEM CONFIG resides, as well as the 2CC disk that holds the USER DIRECT and a new disk, 551, that holds clusterwide utilities such as CPFMTXA and
DIRECTXA. MAINT620 owns all the Release 6.2.0 service disks. MAINT still has disks such as the 190 disk for CMS and the CF1 parm disk for CPLOAD so CMS and CP service can be put into production independently on each member.

Guest Administration and Automation

Administrators can have a view of cluster operations and monitor users on all four members at once. The new AT command allows privileged commands to be issued on any cluster member (e.g., AT MEMBERA CMD XAUTOLOG LINUX01), making writing automation easy. Commands such as MESSAGE, WARNING, and SMSG as well as Inter-User Communication Vehicle (IUCV) connections can transparently reach a target single-configuration virtual machine anywhere in the cluster. These interfaces also support an AT member operand or equivalent to target a specific instance of a multiconfiguration virtual machine explicitly. Some commands also accept an AT ALL operand, allowing them to broadcast or collect information across the cluster. QUERY NAMES, in particular, can now show various views of the guests across the cluster. A simple QUERY NAMES command shows all guests on the member on which it’s issued and the single-configuration virtual machines clusterwide.

Figure 7 shows sample QUERY NAMES output. In contrast, QUERY NAMES AT ALL (Figure 8) shows all guests on all members, with each line prefaced by the member name.

Cross-system use of the Single Console Image Facility (SCIF) is also supported in an SSI cluster. Single-configuration virtual machines may use the SET SECUSER, SET OBSERVER, and SEND commands to communicate with other single-configuration virtual machines no matter where they are. Multiconfiguration virtual machines may only “see” other users on the same member through SCIF. There’s a new AT operand on the SEND command to let you send commands and console input to multiconfiguration virtual machines on other members. SCIF assignments also persist across relocations, so if a Linux server has OPERATOR as its observer, the Linux guest's console will automatically be transferred to the instance of OPERATOR on the destination member after relocation. This cross-system SCIF capability can be useful for automation and console
consolidation.

Guest Relocation

The infrastructure of comprehensive sharing that SSI provides makes moving workload around the cluster a snap. The industry uses the term “virtual server mobility” for this feature, and distinguishes two kinds: static and dynamic. These capabilities can be used to rebalance workload across the cluster, or to evacuate work from a member that must be shut down for service.

Static mobility means shutting down a guest on one hypervisor and bringing it up on another. In z/VM terms, this is simply a logoff and logon. (For Linux guests, the SIGNAL SHUTDOWN or FORCE command can be used rather than LOGOFF to trigger Linux to quiesce and checkpoint its filesystems.) Since both virtual machine definitions and the virtual machine's devices and data are accessible throughout the SSI cluster, moving the guest in this way is a trivial matter.

Dynamic mobility, or LGR, provides a greater technical challenge, but again, SSI is up to the task. The LGR support in z/VM 6.2 allows a running Linux guest to be moved from one SSI member to another. z/VM delivers relocation functionality with the same high level of quality expected from System z. CP does extensive checking to ensure that all the guest's resources are available on the target member and these checks are repeated several times throughout the process to ensure no loss of function, post relocation.

Additionally, z/VM has introduced the new concept of domains, a group of member systems among which a guest may relocate freely without fear of losing facilities or features. This capability lets SSI clusters stretch across different levels of hardware or software and even allows for non-disruptive upgrades of guests.

Further discussion of LGR and relocation domains will appear in a future issue of z/Journal.

Conclusion

For 40 years, customers have turned to mainframe virtualization to handle ever larger and more diverse workloads flexibly and efficiently. The z/VM SSI feature continues this tradition, bringing to bear a host of capabilities to spread work across multiple z/VM member systems. Building on
System z's "share everything" model, SSI allows uniform access to virtual server resources throughout the cluster, under controls that ensure the data integrity System z customers expect. Multi-system installation, a shared service repository, and tools for single point of control and au-tomation can substantially reduce the administrator's burden in maintaining the cluster. Static and dynamic guest mobility let you move work at will to optimize performance or to preserve
application availability through planned system outages. With SSI, horizontal growth becomes a simpler, easier option for z/VM customers.