Feb 1 ’05
ICF Catalog Failure: The Local Disaster You’re Not Planning For!
Virtually all z/OS mainframe data centers have a classic Disaster Recovery (DR) plan of one kind or another to help survive an outage of the overall data center. The DR plan typically has several staff assigned and the plan is tested with elaborate scenarios at least annually. Some installations test portions of their DR plan as often as every other week. For many sites, the costs for “hot site” support can run into the millions of dollars.
The odds are low that anyone will ever declare an actual disaster. Nevertheless, staff goes through the motions, albeit with a lot of grumbling, and success or failure is declared for each test.
With today’s emphasis on cost cutting, why is so much effort, time, and money devoted to a process that everyone hopes will never be actually activated? Senior management, often at the CIO level, know their fiduciary responsibility to have a DR plan in place and that they’re accountable if a disaster occurs with no DR plan in place.
It’s no accident that the term disaster recovery has changed in recent years to business continuity , because that’s what it provides—the continuity of your business when an outage of your entire data center occurs.
ICF Catalog Risk Factors
A serious data outage is waiting in the wings at many installations, yet if anyone is even aware of the danger of this outage, it’s simply ignored. This potentially devastating outage is the failure of an ICF catalog.
Typically, the technical staff says that “catalogs never break; we haven’t had a catalog failure in five years,” or some other excuse that keeps them covered. When they do admit to a catalog failure, the attitude is that this type of failure is rare, and that a recovery time of several hours is perfectly acceptable. Because this type of outage isn’t likely to gain outside notoriety, it’s often swept under the carpet as “one of those pesky computer glitches.” Nevertheless, when an ICF catalog outage occurs, it can result in millions of dollars of lost revenue, and failure of customer service level agreements.
When technical staff are asked if they have recovery procedures in place and thoroughly tested, they often roll their eyes, flash smiles, and admit that, no, they have never really tested for recovery of a catalog. If they have tested, it typically involves restoring their catalogs during the initial process of setting up for their DR test. Or, worse, they tested 15 years ago when they purchased a software tool for catalog recovery, but haven’t touched it since.
At most installations, technical teams already have their hands full with daily problems and processes. Since ICF catalog failures aren’t an everyday occurrence, planning for a catalog failure isn’t a priority task. The other reality is that management hasn’t made it a priority to create and maintain an ICF catalog recovery plan, nor have they been given an established requirement to regularly test it.
More to the point, it’s a safe bet senior management has never heard of ICF catalogs, and isn’t aware of the acute risk to data access failure that results from an ICF catalog outage. Without that awareness, there’s no drive to implement such a plan.
z/OS systems have hundreds of thousands, sometimes millions of data sets. The concept of “cataloging” has been around for many years to assist in locating these data sets. In the past, cataloging was optional, but widespread use of system-managed storage requires the cataloging of all data sets under its control. The ICF catalog is where all data sets are cataloged, and access to data sets, whether from a batch application or an online system such as CICS or DB2, is possible only through a successful catalog search. If the catalog isn’t available, access to the data isn’t possible.
Figure 1 shows a layout of the ICF catalog and its associated metadata structures (VTOC and VTOCIX) for a single VSAM data set. Note that there are multiple records in the BCS, plus several more in the VVDS, VTOC, and VTOCIX on the volume where the data set resides. All these records, across all metadata structures, must be intact, with synchronized information, for the data set to be accessible.
The aspect that makes catalog failure a high risk factor is the ratio of data sets to ICF catalogs. Most installations, even the largest ones, typically have fewer than 100 catalogs on a system—usually about 25. Let’s assume they catalog 1,000,000 data sets. If the data sets were distributed evenly across the catalogs, on average, each catalog would catalog approximately 40,000 data sets. If you lose access to any one of those catalogs, for even an hour or two, you lose access to a considerable number of data sets (probably one or more entire applications).
If the number of data sets per catalog isn’t evenly distributed, the danger of a failure becomes even greater if one of the larger catalogs develops a problem.
Catalog Data Set Distribution
Why not have lots of catalogs, with cataloged data sets spread widely across them, to reduce the danger? Well, you can, and each installation has control over exactly how they design and set up this catalog and data set environment to work best for their particular environment. Unfortunately, there are factors working against this. The more catalogs you have, the more daily management is required to keep them clean and error-free. The more catalog backups you have to manage and keep track of, the greater the risk of a catalog failure. Too few catalogs result in catalogs that are too big, too unwieldy, and also prone to failure. There’s simply no magic number that works for all installations.
As a rule, the deciding factor is the number of “applications” that run on your mainframe. An application can be defined as a collection of related data files and their associated processing programs. As an example, human resources could be defined as an application, payroll, or customer data. Within an application, there might be hundreds or thousands of data sets, and usually, all data sets within an application will be cataloged in the same catalog.
Here’s how it works. The path to locating the catalog is a technique called “alias match,” where the alias is a value that represents the application. The catalog is defined to the system as having an alias of that value and that same alias is assigned to the high-level node for all data sets in that application (see Figure 2). To locate any data set within that application, an alias table is searched to identify its assigned catalog, and then the catalog is searched for the fully qualified data set name.
The problem is that most catalogs have multiple aliases, representing multiple applications, whose data sets are cataloged in the same catalog. It isn’t uncommon for a production catalog to have 50, 75, or even 100 or more aliases assigned. Over time, new aliases are assigned to existing catalogs and the number of data sets for an application grows. Before you know it, a critical catalog has bulged far out of proportion. If the catalog has an unplanned outage for any reason, many applications will lose access to their data sets until the outage is corrected.
Figure 3 shows a real-world example of the user catalog configuration on one system at a major z/OS installation. There are more than 1.3 million data sets on the system, and nominally 25 user catalogs, yet a full 79 percent of the data sets—almost 1.1 million—are cataloged in just five catalogs. The largest catalog has 377,441 data sets cataloged in it, representing 27 percent of the data sets on this system. If any of these five catalogs suffers an outage of any kind, a large number of data sets immediately become unavailable. Consider the 241 aliases on these five catalogs, representing the applications that will be affected by a catalog outage.
Figure 4 illustrates the catalog environment on one system at a major banking facility where 34 user catalogs contain entries for 2.7 million data sets. Here again, a huge percentage (74 percent) of these data sets are cataloged in just five catalogs, and a whopping 39 percent (more than 1 million) are cataloged in a single catalog! Lose any of these catalogs for a few hours and you have a major business disruption.
Similar situations emerge at virtually every OS/390 or z/OS data center, large or small. The worst data centers have a single catalog where every single data set in the installation is cataloged in the same catalog. That’s a disaster waiting to happen!
Obstacles to Overcome
There are reasons for this unfortunate situation:
- Many z/OS systems programmers cling to a decades-old belief that ICF catalogs don’t break. This is rubbish, but opinions die hard. The result is a lack of attention to this area.
- Taking any action to improve the catalog environment typically requires one or more catalogs to be taken out of service while corrective action occurs. Many systems programmers are afraid to touch catalogs for fear they’ll cause more problems than they fix. Actually, many catalogs are already broken in one way or another and either ignorance or band-aid procedures are in effect to sidestep the problem areas.
- Non-stop processing is also a barrier. Online systems such as CICS or DB2 run for weeks at a time. When the online systems aren’t accessing the data, batch jobs are. With so many aliases (applications) using any given catalog, it’s difficult to schedule downtime on a catalog.
For these reasons, the catalog environment doesn’t get cleaned up.
Why Are ICF Catalogs Breaking?
ICF catalog management code dates back to 1985 and has been stable for many years. Recently, a requirement for greater shared access integrity and performance improvements has led to significant code changes. Perhaps that’s the reason ICF catalogs seem to be failing now more than ever. Hardly a week goes by without a story of a major catalog failure surfacing. Often, it’s accompanied with details about how several hours were necessary to correct the problem— typically due to lack of staff expertise in how to perform a catalog recovery, or the lack of a plan or utility software to affect the recovery.
The number of complaints reported in the IBMLink database shows the frequency and severity of catalog-related problems.
Often, catalogs simply aren’t kept clean. Because it’s believed they never break, little attention is given to diagnostics across the breadth of the catalog environment. Typically, the available diagnostics are run only when a known problem exists, with little thought given to problems in the background, consisting of various structures within the catalog that are broken, but don’t stop it from running on a given day. This type is likely to jump out and bite you when it’s least expected and at the worst possible time.
System technicians need to recognize that the catalog environment is a single point of failure and that catalogs do fail. They then need to develop a catalog failure test plan, not on the scale of an overall DR plan, but a real plan nevertheless. They need to test this plan in realistic ways, not just going through superficial motions, but actually testing how to recover from a failure of a critical catalog.
In addition, an analysis should be made of the current catalog environment with an eye toward spreading the data set load across as many catalogs as possible. If a large percentage of your data sets are now cataloged in just a handful of catalogs and there are multiple aliases (applications) assigned to each of these catalogs, they need to create a workable number of additional catalogs and then move records that are too large from existing catalogs to new catalogs. This process is known as splitting and merging. Many systems programmers are afraid of running the standard split/merge utility, as already-clean catalogs are imperative, and from a practical standpoint, the catalogs must be quiesced.
Finally, the catalog environment should be regularly diagnosed and any errors found should be corrected as quickly as possible.
Senior Management Commitment
Senior management at most installations isn’t even aware that ICF catalogs exist and are unaware of the risk and danger to business continuity that an unplanned ICF catalog outage presents. So, there isn’t a business commitment to plan and practice for this type of outage as there is for the overall DR plan.
Without commitment from senior management to mitigate this risk, it’s unlikely the technical staff will take it upon themselves to correct the situation. Just as the overall DR plan is the result of directives from the highest levels of IT management, ICF catalog cleanup and recovery will also have to be directed by senior management.
This requires creating a plan that analyzes the current risk and, possibly, a project to restructure the catalog environment to spread out the risk. A recovery testing plan must be developed and a schedule created for serious testing. Since a catalog outage can occur at any time, multiple staff members must be knowledgeable in the tools and procedures to achieve a catalog recovery, and they should be required to participate in the test plan.
ICF catalog outages are real and occur every day to z/OS installations. They result in significant disruptions of business continuity. Without serious recognition of the risks associated with ICF catalog outages, the development of an ICF catalog recovery plan, and a commitment to test this plan regularly, your installation faces the possibility that a local outage will occur with consequences that may be as devastating as a data center disaster.