Jul 18 ’14

Reducing the Potential for CICS Storage Violations

by Russ Evans in Enterprise Tech Journal

Storage violations in CICS are unpredictable, notoriously difficult to identify and resolve, and can cause data corruption and region failure. For this reason, it’s important to learn the common causes of storage violations and what enhancements IBM has made to CICS that reduce the potential for storage violations.

The term “storage violation” has a very precise definition in CICS and is related to control blocks that existed prior to the new architecture introduced in CICS/ESA. In early releases of CICS, the control blocks that tracked task-related storage lived within the storage area they described. These Storage Accounting Areas (SAAs) contained information, including the length of the storage area and the address of the next SAA on the chain. If this information was over-written, CICS couldn’t track the next area of storage on the chain, meaning the storage couldn’t be freed when the task terminated. Storage violation was used to describe the specific condition where an 8-byte SAA had been altered. Because the SAA lived within task storage, it was common for a programmer to lose track of storage boundaries, and storage violations occurred frequently.

In addition to this strictly limited technical definition of a storage violation, the presence of an overlaid SAA almost always indicated that some additional storage had been corrupted. As CICS ran in a single address space and storage key, this corruption could occur anywhere in the region, including CICS and user load modules, control blocks and file buffers. The message that “A Storage Violation Has Occurred” was frequently followed by the CICS region failing in spectacular and inexplicable ways. It’s important to note the reverse isn’t always true; a CICS region can suffer huge storage overlays without generating a storage violation simply because no 8-byte SAA had been affected.

IBM recognized the need to reduce the frequency and severity of storage violations, while at the same time simplifying the process of problem determination, and has directed significant effort toward this goal. One of the significant enhancements IBM introduced was to relocate CICS’ critical control blocks outside the bounds of task-related storage—including the data held in the SAA. Rather than embedding SAAs in task storage, this critical data was moved into a new series of control blocks that live in 31-bit storage areas normally not addressable by a CICS task. The SAA was replaced by the storage check zone, a control block that’s almost universally referred to as the crumple zone. The crumple zone is an 8-byte area containing only a literal that’s formed by concatenating the task’s task number with a 1-byte field indicating the type of storage; whenever CICS acquires task-related storage, it increases the size of the getmain by 16 bytes and appends a crumple zone at the start and end of the storage. The address returned to the requesting program is always the first byte following the leading crumple zone, so the program never has addressability to it. When CICS frees the storage, it checks the values of the leading and trailing crumple zones. The definition of a storage violation is that any crumple zone has been corrupted.

Intercepting Storage Violations

While storage violations are almost always the result of an application coding error, the actual cause of the overlay can be difficult or impossible to determine. In many cases, the task in control when the violation was detected isn’t the task that caused the violation; frequently, the offending task has completed and already been cleaned up by the time CICS notices the violation. Prior to the introduction of CICS/ESA, there were only two avenues available to achieve the goal of reducing the number and severity of storage violations. The first—long, difficult and prone to disappointment—was reviewing CICS storage violation dumps to determine the cause of the violation; the second—proactive, long-term and with difficult to prove results—came through enforcement of proper coding practices. Applications programmers felt that the systems programmers didn’t make dump reading a priority but instead pushed for source code changes “just for the sake of standards” while systems programmers felt applications programmers ignored standards and assumed systems would always be available to read dumps. Both sides, meanwhile, complained to IBM.

IBM responded and has worked diligently to provide CICS with the capability of both preventing storage corruption attempts at run-time and improving the documentation provided when CICS produces a storage violation dump. The result are four options that will abend a task that attempts certain types of storage violations.

CICS storage protection was the first enhancement released by IBM that actually prevented storage violations. Prior to storage protection, the entire CICS address space ran in the same z/OS storage key, meaning that any user program could overwrite any user or system storage at any time. Storage protection changed this by allowing the CICS address space to be divided into two different storage keys. Key 8 (CICS key) is the primary storage key and key 9 (user key) is the subordinate key. Tasks running in CICS key (which includes most of the CICS system code) can read and write both CICS key and user key storage; tasks running in user key can read and write user key storage and can read CICS key storage, but will abend with a S0C4 if they attempt to overwrite CICS key storage. With the primary CICS control blocks and system code loaded into CICS key storage, these critical areas of CICS were protected from storage overlays. Further, if a user key program attempted to overwrite CICS key storage, the S0C4 abend would generate a transaction dump showing exactly which line of code caused the error, significantly reducing the effort required for problem determination. New program and transaction definition parameters were added to allow the customer to define non-IBM resources as CICS key where desired.

Storage protection is activated at the region level by setting the STGPROT option in the DFHSIT to ‘YES’ (the default is ‘NO’). Once activated, task-related storage is allocated from CICS or user key based on parameters in the TRANSACTION and PROGRAM definitions.

Storage protection didn’t solve the storage violation issue. By leaving all user programs to run in the same key (key 9), it had no impact if one user program violated another user program’s storage. Worse, some older CICS applications couldn’t run in user key, leaving those applications unprotected and capable of overwriting any storage in the region. Still, storage protection was a major step toward actually controlling the storage violation problem.

Transaction isolation, introduced in CICS/ESA 4.1, provided significant improvement over the storage protection feature. Transaction isolation uses the subspace group facility of MVS and allows CICS to function as if every user task had its own z/OS storage key. Tasks that run with transaction isolation active have read and write access to their own task storage and certain shared storage areas, read only access to CICS key storage and no access to the task-related storage for other user key tasks. This last feature—making other task’s storage fetch-protected—
went far beyond storage protection by abending programs that would never have generated a storage violation. By casting a wider net, CICS could now identify programs with addressability issues that might have led to storage violations in the future.

Transaction isolation is activated at the region level by setting the TRANISO option in the DFHSIT to ‘YES’ (the default is ‘NO’). Once activated, transactions are run as isolated based on parameters in the TRANSACTION definition.

Unfortunately, some older applications rely on the ability of one task to access another task’s storage, meaning these applications couldn’t take advantage of transaction isolation without recoding. Transaction isolation also increases a task’s virtual storage utilization, making it difficult to implement in a region that was virtually storage-constrained.

Reentrant program protection expands storage protection by allowing reentrant programs to be loaded into read-only storage in one of the CICS read-only dynamic storage areas (RDSAs). If reentrant program protection is active, any attempt to overlay a reentrant program in CICS storage will result in an S0C4 abend.

Reentrant program protection is activated at the region level by setting the RENTPGM option in the DFHSIT to ‘PROTECT’ (this is the default; to de-activate, set RENTPGM to ‘NOPROTECT’). Once activated, any load module that has been linked as RENT will be loaded into read-only storage. The best part of reentrant program protection is that it has literally no overhead at run-time.

Because the program binder performs no validation of the RENT attribute, it’s quite common for programs that are actually non-reentrant to be linked as RENT. If this is the case, setting RENTPGM to PROTECT will cause any non-reentrant activity to abend, making the program unusable. This problem can be resolved without recompiling by relinking the program as NORENT. Only truly reentrant programs should be linked as RENT if reentrant program protection is to be used.

Command protection was introduced to plug a loophole in storage protection and transaction isolation, commonly known as the “hired gun” effect. This comes into play when a program issues an EXEC CICS command that returns data, but passes CICS an invalid address for the receiving field. CICS is still in control and running in CICS key when the data is returned, so CICS will overlay whatever storage is found at the return address provided, regardless of the key the user task is assigned. When command protection is in effect, CICS will validate the first byte of the address passed with the EXEC CICS command to ensure the requesting program actually has authority to write to that storage area; if the program doesn’t have authority, CICS will abend the task with an AEYD abend. The key limitation to command protection is that only the first byte is tested; if the return data area starts at a valid address but then continues into an invalid area, CICS will allow the overlay to occur.

Command protection is activated at the region level by setting the CMDPROT option in the DFHSIT to ‘YES’ (this is the default; to deactivate command protection, set CMDPROT to ‘NO’). Once activated, CICS will validate the first byte of any return area on every EXEC CICS command that’s processed. Because CMDPROT offers limited protection at the expense of some additional overhead on every command, CICS regions without active storage violations tend not to make use of it.

Dealing With Storage Violations

Even with these protection options active, some storage violations may continue to occur, but now the majority of violations are self-inflicted. If a CICS region is set so every application transaction is running in user key with transaction isolation, storage protection, command protection and reentrant program protection active, it still has the authority to overwrite its own user storage. Every transaction has a unique set of storage areas that are acquired as the storage is needed. Note that programs can acquire storage explicitly via the EXEC CICS GETMAIN command, but the bulk of a task’s storage is acquired by CICS to fulfill a request. For example, when an EXEC CICS LINK is issued, CICS must acquire storage to retain the current state of the transaction, and LE must acquire program-related storage, including working storage.

Each of these storage areas has increased by 16 bytes to hold the leading crumple zone and trailing crumple zone. While the program is never given addressability to the leading crumple zone, the trailing crumple zone can be inadvertently addressed by using more storage than was requested; for example, by overrunning a table past its OCCURS clause. Since the trailing crumple zone is often followed by the leading crumple zone for the next storage area, it can be overlaid in the same way. Even though CICS might not detect such a storage violation until task termination, the fact the transaction in control at the time the violation was found is also the culprit means the dump CICS produces is much more likely to contain sufficient information, especially in the CICS trace table, to identify the cause of the violation.

Preventing Storage Violations

There are two COBOL coding errors that between them account for the vast majority of storage violations: incompatibility in the DFHCOMMAREA length and table index overruns. In most cases, neither of these errors will be intercepted by the CICS protection functions described earlier.

The COBOL compile option SSRANGE, in conjunction with the LE run-time option CHECK, will intercept attempts by a COBOL program to access an out-of-range entry in a table. When SSRANGE is used, the compiler adds code to every table accessed in the program to ensure the current index value isn’t greater than the OCCURS value. The CHECK run-time option is used to activate this checking. The combination of SSRANGE and CHECK can add significant CPU overhead to your processing, and should be used only when other attempts to identify the error code have failed. Generally speaking, it’s never a bad idea to compile all programs in the development environment with SSRANGE and have CHECK on in all development CICS regions.

With SSRANGE and CHECK activated, an attempted access to an out-of-range table entry will generate an LE U1006 abend, and a message will be written to CEEMSG that describes the error in detail.

There’s no automated method to identify storage overlays that are the result of DFHCOMMAREA length mismatches. When a program issues an EXEC CICS LINK, XCTL or RETURN with the COMMAREA option, it passes the desired length of the area to CICS. The serving program, when activated, will be passed the address of the DFHCOMMAREA and the length specified by the caller will be in EIBCALEN. The called program assigns the passed address to an 01 level in its LINKAGE area, and can then either use the data description in linkage or copy the area to working storage. There’s no problem if the called program’s data layout addresses an area less than the actual EIBCALEN; that simply leaves some unused storage at the end of the area. The problem occurs if the data layout addresses an area greater than the EIBCALEN. For example, if the actual commarea length is 100 bytes, but the 01 level in linkage includes a field that starts at offset 101, any attempt to update that field will cause a storage violation.

Commarea size mismatches are the leading cause of storage violations, yet they’re the simplest to prevent. Simply add code to the called program to compare the value in EIBCALEN to the length the program expects for the commarea; if the values differ, the transaction should be abended. There are no acceptable circumstances under which any program should reference a commarea without verifying the length first.

Debugging Storage Violations

 
Reading CICS storage violation dumps is more of an art form than a science. To begin with, you must have a full system dump of the state of the region at the time the storage violation was detected. CICS produces a system dump by default unless you’ve used dump suppression to manually prevent it, but your internal procedures must ensure the dump data set is preserved. While it’s almost impossible to resolve a storage violation without the internal CICS trace table, IBM recognizes that many shops can’t support the CPU overhead required for the trace and will create an “exception” trace record when a storage violation is detected. These exception entries are collected even when tracing is disabled and contain information that will greatly assist in evaluating the dump.

As a rule of thumb, start the problem determination process by finding the damaged crumple zone that’s located at the smallest address value. Because CICS only checks the crumple zones when an area is freemained, most storage violations are found at task termination time (i.e., when CICS frees all the remaining user storage areas associated with the task). It’s possible for a large number of crumple zones to be damaged before CICS recognizes that a storage violation has occurred, and the crumple zone that’s noticed first may not be at the beginning of the storage corruption.

If the first damaged crumple zone is a leading one, then the culprit is most likely not the task that owns the storage area in question. If the first damaged is a trailing crumple zone, then the task owning the storage is likely the culprit. This is due to the nature of crumple zones. To start with the second case first: As described earlier, programs are never given addressability to their leading crumple zones, but always have addressability to their trailing crumple zones by overrunning the end of a storage area. While there may be a damaged leading crumple zone immediately following the damaged trailing one, that can still be the result of overrunning the end of a storage area. However, if a leading crumple zone is the first damaged, it’s less likely this task caused the problem. Again, this task has never had addressability to the leading crumple zone, and if the corruption had been started in this task’s user storage, it most likely would have hit a trailing zone before it got to a leading zone.

On the other hand, storage violations are by their nature unpredictable. It’s quite possible for a program to loop through a routine that corrupts a few bytes of storage and then skips over a few kilobytes before its next overlay. In this case, the crumple zone that eventually is corrupted would be chosen completely at random and would be of no help in identifying the failing code.

The CICS trace table is invaluable when attempting to identify the cause of a storage violation. Once the damaged crumple zone has been identified, search the trace table backward to find the trace record that was written when the area was acquired. The trace will show who asked for the area to be acquired, and why. For example, a user program might make an explicit GETMAIN request, or the user program might have issued an EXEC CICS request that required storage to process, or LE may have needed the storage for some internal use. Since the damage must have occurred between the time the storage was acquired and the time the storage violation was noticed, use the trace entries between these two events to identify which other tasks were active during this period.

Conclusion

As the result of a combination of increased robustness of CICS application programs and the efforts IBM has made to reduce the ability of programs to corrupt storage, the number and severity of storage violations has dropped significantly. Despite this, it’s important to recognize the implications that lie behind the occurrence of even a single storage violation: that some unknown program has corrupted an unknown amount of storage. The only thing we know for sure about a CICS region that’s suffered a storage violation is that we know absolutely nothing about the state of the region at that time. Data buffers may have been overlaid while being written to DASD, other task’s working storage may have been altered in ways that change their logic paths and literally any portion of the CICS environment may have been compromised. By their nature, storage violations are random and unpredictable and should always be considered a high-severity problem.

If your shop never experiences storage violations, it’s still good practice to activate the low-overhead protection options, such as storage protection and reentrant program protection, as a preventive measure against future problems. If your shop sees storage violations on a regular basis, and the various protection features described here aren’t already in place, they should be reviewed and activated. If your shop is CPU-constrained, consider activating high-overhead options such as SSRANGE and command protection in the development environment only.

Many shops make the decision to run without trace to reduce the measurable overhead that trace requires. This is a reasonable decision. In fact, it may be a requirement if the processor is full or the CICS is CPU-constrained, but it does have the negative side-effect of drastically reducing the possibility of resolving a storage violation. If your shop suffers from regular storage violations, it may be worth incurring the additional overhead of trace in production; again, activating full trace in test or development would add minimal overhead. When activating trace, be careful to set the size of the trace table large enough to actually capture the relevant data following an abend; remember that increasing the size of the table uses more virtual storage but doesn’t add to the CPU overhead.