Feb 1 ’05

Key Management Techniques: Taming the Mainframe Storage Monster

by Editor in z/Journal

With all the press coverage about the growth of open systems, IT management may forget that mainframe-based DASD remains the backbone of most large enterprises because the majority of mission-critical data and applications still depend on the mainframe. Mainframe systems are an attractive, valuable resource because they’re:
 

- Relatively inexpensive in terms of storage costs

- Architecturally streamlined

- Effective in running proven software

- Stable and mature enough that best practices and policies are established


- Robust in processing strength

- Precise enough to control the rate of work from single host systems to huge parallel sysplex configurations.

Despite these factors, a crisis is brewing in the mainframe space. Growth slowed, but didn’t stop, during the recession. Head count was trimmed, but deadlines remained the same or were accelerated. Problems that hampered the open systems community happened on a smaller scale in the mainframe world.

Storage-related issues for mainframes are largely similar to open systems. Disk arrays are physically identical, with differences deeply buried in the microcode of the arrays. Storage Area Networks (SANs) are little more than the open systems version of the ESCON and FICON prevalent in mainframe systems for years. Also, application changes to support industry initiatives, such as compliance demands, are putting more pressure on mainframe storage. For example, companies are required to keep more data for longer periods—and still be able to recall that data on a moment’s notice. The mainframe is an appealing choice for this because data stored on the mainframe is less expensive and more secure than data stored on open systems platforms.

Mainframe storage administrators face several challenges:

- Obtaining or maintaining a thorough knowledge of their environment

- Managing performance and problems

- Ensuring space availability and recoverability

- Securing authorization.

Training is a significant issue. Most mainframe storage administrators receive on-the-job training, so learning the proper storage management philosophy is sometimes hit-or-miss. This article provides guidance to help storage administrators overcome the challenges of managing performance, availability, recoverability, authorization, and devices.

Storage management is unforgiving. The margin for error is smaller than ever, yet storage administrators often have only one chance to get things right. They usually don’t have the time or resources to fix things. For example, if a job abends because of a space problem, the storage administrator now has two problems:

- Determining the cause of the abend

- Correcting and rerunning the job.

When providing cause analysis, the storage administrator must implement a solution to ensure the problem doesn’t recur.

A thorough knowledge of what to do in certain circumstances is essential for the mainframe storage administrator. Automated processes can keep mainframe storage environments running smoothly. Intelligent automation—tools that learn based on historical events—create even more efficiencies by automating common, simple tasks and some complex tasks. Armed with tools that detect, identify, diagnose, and resolve the problem automatically, the storage administrator can focus on bigger issues. Storage Resource Management (SRM) tools can help the storage administrator spend valuable time on critical issues by freeing time otherwise spent on mundane, time-consuming, or labor-intensive tasks.

Performance

Consider the mix of workloads that run on mainframes. Sure, there are monitoring tools, but where’s the problem? How do you pin down the issue? DASD is one of the few mechanical devices left in the data center, but it doesn’t take much for DASD issues to drastically affect system performance. Conversely, adding cache memory to avoid I/O wait can improve performance.

Unfortunately, solving storage-related performance issues is rarely as simple as adding cache to an array. The symptoms are usually non-specific workload slowdowns, so merely identifying the issue can be a struggle. Through reporting and automation capabilities, storage administrators can monitor the storage configuration for warning signs of impending problems from various sources:

- System Managed Storage (SMS)

- Hierarchical Storage Management (HSM)

- Storage devices

- Other resources.

Ideally, the storage administrator has a tool that consolidates disparate information into one, easy-to-use application, eliminating the problem of selecting one of the many available tools.

A storage administrator’s job frequently includes interactions with performance analysts. About 20 percent of a storage administrator’s time is spent worrying about performance-related issues. Of that percentage, the vast proportion is spent concerning transaction processing and databases. Tools can help with this process by providing the ability to drill down from user data sets—through emulated mainframe volumes to vendor disk subsystems—to diagnose and correct performance and availability issues. The storage administrator must have a clear view of what’s happening from a performance standpoint and be able to communicate with performance analysts who have access to baseline data.

Availability

Availability problems usually center on access to data and space availability. Intelligent SRM tools can direct allocations, regardless of whether the allocation uses SMS, to places where the data will be available, helping to ease both concerns.

In terms of access to data, a data set might be created on tape in a manual tape library with the issue of a human finding and mounting the tape (a real challenge when the tape is located offsite). In other sites, a robotic tape library might be used, but there are issues with the availability of cartridge slots or how busy the robot or tape drives are.

Availability is considered as the space that can be used for new allocations in a storage group. HSM keeps storage groups at desired occupancy levels by migrating, deleting, or consolidating data sets based on their SMS attributes. Sometimes, storage groups can fill to the point that new allocations fail; sometimes, the space simply gets so fragmented that it’s impossible to allocate the space needed without exceeding allocation rules. To avoid these problems, storage administrators can rely on monitoring solutions that understand and interpret SMS and HSM activities into easily understood reports to show the effectiveness of their storage operations.

Space availability remains a major issue to resolve. Through monitoring, reporting, and automation capabilities, storage administrators can easily prevent runaway storage allocations and monitor pools and storage groups for exceptions.

Installing new microcode can also be a difficult task. This may require a series of regression tests or studying the documentation to ensure that applicable tests occur. Few problems are harder to troubleshoot than microcode errors.

Recoverability

Application and data set recovery is never fun. Recovery is always a pressure situation and one that can cause small problems to snowball into major issues. Often, the storage administrator is responsible for recovering data. The usual exceptions are databases and (sometimes) the operating system itself. Databases have specialized tools for backup, but HSM, Data Set Services (DSSes), or another vendor product handle most data sets.

Intelligent SRM tools report on backups in HSM, integrating the database information into the storage environment. Automated recovery tools make it easy to back up and recover all types of data, even to the point when the failure occurred. One reason mainframe storage is inexpensive to manage and requires fewer people to manage it is that mainframe storage administrators implemented tiered storage, including migration, several years ago. This means not as much data needs to be backed up because some of it may have already been migrated.

Authorization

Storage administrators are responsible for the corporate crown jewels. Yet, often, the practice of securing authorization to these resources is left to staff or individuals unfamiliar with the specific data protection mechanisms built into the system. In the case of storage management, certain authorization profiles must be in place to prevent the system or users from inadvertently causing damage. For example, SMS introduced the requirement that every managed data set must have a catalog entry. Even today, however, it’s not widely known that the security profiles to ensure catalog entries aren’t created by default.

In large, multi-system environments, Global Resource Serialization (GRS) or its equivalent is required to prevent systems from damaging one another. Components such as SMS and HSM depend on multi-system serialization functions to keep vital system structures safe. The most frequent reason HSM control data sets get corrupted is that serialization specifications are set incorrectly. The same philosophy applies to catalogs; they must be locked to ensure that users don’t inadvertently damage them. Even where tape data sets are concerned, the storage administrator gets involved because today’s environments require catalog integrity. As part of implementing automated storage management, companies can establish catalog rules that ensure storage environment security.

Problem Management

There’s never a good time for a problem and they often come in bunches. Much of a storage administrator’s time is spent managing problems. Given the storage administrator’s responsibilities, to shirk this area would only lead to bigger problems. Types of problems vary, but they generally fall into one or more of the major areas we’ve discussed. The tasks most likely to consume a storage administrator’s time are adding DASD, moving files, or finding data sets users have lost. Intelligent SRM automation tools, when combined with tools that prevent out-of-space abends, can drastically reduce or eliminate these problems. SRM reporting tools can help administrators quickly resolve the problems that do occur.

Conclusion

In today’s fast-paced, unforgiving, storage environment, learning on the job is like a trial by fire. SRM solutions give the storage administrator the information needed to solve problems on the first try by immediately getting to the root cause of a problem.