Operating Systems

For these reasons, the catalog environment doesn’t get cleaned up.   

Why Are ICF Catalogs Breaking?  

ICF catalog management code dates back to 1985 and has been stable for many years. Recently, a requirement for greater shared access integrity and performance improvements has led to significant code changes. Perhaps that’s the reason ICF catalogs seem to be failing now more than ever. Hardly a week goes by without a story of a major catalog failure surfacing. Often, it’s accompanied with details about how several hours were necessary to correct the problem— typically due to lack of staff expertise in how to perform a catalog recovery, or the lack of a plan or utility software to affect the recovery.   

The number of complaints reported in the IBMLink database shows the frequency and severity of catalog-related problems. 

Often, catalogs simply aren’t kept clean. Because it’s believed they never break, little attention is given to diagnostics across the breadth of the catalog environment. Typically, the available diagnostics are run only when a known problem exists, with little thought given to problems in the background, consisting of various structures within the catalog that are broken, but don’t stop it from running on a given day. This type is likely to jump out and bite you when it’s least expected and at the worst possible time. 

The Solution   

System technicians need to recognize that the catalog environment is a single point of failure and that catalogs do fail. They then need to develop a catalog failure test plan, not on the scale of an overall DR plan, but a real plan nevertheless. They need to test this plan in realistic ways, not just going through superficial motions, but actually testing how to recover from a failure of a critical catalog. 

In addition, an analysis should be made of the current catalog environment with an eye toward spreading the data set load across as many catalogs as possible. If a large percentage of your data sets are now cataloged in just a handful of catalogs and there are multiple aliases (applications) assigned to each of these catalogs, they need to create a workable number of additional catalogs and then move records that are too large from existing catalogs to new catalogs. This process is known as splitting and merging. Many systems programmers are afraid of running the standard split/merge utility, as already-clean catalogs are imperative, and from a practical standpoint, the catalogs must be quiesced.   

Finally, the catalog environment should be regularly diagnosed and any errors found should be corrected as quickly as possible. 

Senior Management Commitment  

Senior management at most installations isn’t even aware that ICF catalogs exist and are unaware of the risk and danger to business continuity that an unplanned ICF catalog outage presents. So, there isn’t a business commitment to plan and practice for this type of outage as there is for the overall DR plan.   

Without commitment from senior management to mitigate this risk, it’s unlikely the technical staff will take it upon themselves to correct the situation. Just as the overall DR plan is the result of directives from the highest levels of IT management, ICF catalog cleanup and recovery will also have to be directed by senior management.   

This requires creating a plan that analyzes the current risk and, possibly, a project to restructure the catalog environment to spread out the risk. A recovery testing plan must be developed and a schedule created for serious testing. Since a catalog outage can occur at any time, multiple staff members must be knowledgeable in the tools and procedures to achieve a catalog recovery, and they should be required to participate in the test plan. 

Conclusion  

ICF catalog outages are real and occur every day to z/OS installations. They result in significant disruptions of business continuity. Without serious recognition of the risks associated with ICF catalog outages, the development of an ICF catalog recovery plan, and a commitment to test this plan regularly, your installation faces the possibility that a local outage will occur with consequences that may be as devastating as a data center disaster.

4 Pages