Nov 5 ’13

CICS and the Open Transaction Environment: What to Do Without DB2

by Russ Evans in Enterprise Tech Journal

This series reviews the CICS Open Transaction Environment (OTE) and its effect on CICS and how CICS uses machine resources. The first article, “CICS and the Open Transaction Environment: A Retrospective,” (available at http://esmpubs.com/q98ug) examined the history of CICS and IBM’s attempts to provide relief from CICS’ chronic CPU constraint, culminating in the introduction of the OTE. This second article concentrates on the potential benefits of OTE to non-DB2 users, while the final article will focus on the potential CPU savings OTE can provide to CICS/DB2 applications.

When the CICS OTE was announced, it was viewed primarily as a way to reduce CPU utilization; the potential for up to a 40 percent reduction in CPU for large CICS/DB2 applications overshadowed anything else the OTE might provide. This savings only occurs if the programs in use are written to threadsafe standards and are allowed to run on an L8 Task Control Block (TCB) following an SQL call; the longer a program remains on its L8 TCB, the greater the CPU savings. (While the OTE includes several different TCB types, this article only discusses the L8 and L9 TCBs. Usually, there’s no functional difference between an L8 and an L9 TCB, so this article uses the term “L* TCB.” The qualified TCB type [L8 or L9] is used only when required to distinguish the differences between these types.)

However, as more programs spent a greater portion of their processing time on L8 TCBs, it became apparent that OTE exploitation had benefits beyond DB2 CPU savings. Here, we will discuss how the use of the OTE can reduce hardware costs, increase multiprocessor exploitation, eliminate bottlenecks and simplify your CICS Multi-Region Operation (MRO) complex, all without any reference to DB2. We will also cover threadsafe non-compliance errors and how to distinguish them from “normal” abends. 

The Effect of OTE on CPU Constraint
CICS’ single-TCB architecture—the Quasi-Reentrant, or QR, TCB—has consistently tied large CICS users to high-capacity individual processors. An active CICS region would always perform better on a small capacity box with a single, fast processor than on a large capacity box with multiple, slower processors. As a region’s workload increased, the CPU requirement of the QR TCB would begin to approach the total capacity of a processor and performance would suffer; this condition, known as QR constraint, was to be avoided.

There are three ways to relieve QR constraint; however, all have negative consequences: reduce the workload, upgrade the CPU and offload work to another region.

To reduce workload, CICS has several tuning options available to throttle incoming transactions, smoothing out the normal short-term peaks and valleys of CPU utilization and preventing the region from bogging down. This has the unfortunate effect of increasing response times for all transactions and is only a temporary solution; eventually, there are no more valleys and response times become unacceptable.

While it’s a simple matter to suggest a CPU upgrade, financing it is a different story. Add the accompanying software costs and an upgrade quickly becomes a last resort. This is especially true if only CICS is constrained. It’s possible, theoretically, for CICS to be fully QR constrained on a triadic processor while the remaining processors are lightly used—leading to a situation where a machine is “maxed out” at less than 50 percent utilization.

Offloading work to another CICS region is the most commonly used option. A new CICS region, an Application Owning Region (AOR), is created and an application is removed from the existing region and installed in the new one. If there’s only one application in the constrained region, the region is “cloned,” the application is installed in both regions and a workload manager distributes the load between them. The resulting complex, called MRO, is used, to some degree, in most CICS shops. While MRO has benefits, its drawbacks include the additional overhead associated with:

• Adding a new CICS region
• Transaction routing
• CICS function shipping to share application resources
• Managing the complex.

When a transaction is executing in the OTE, it’s running under its own TCB, not the QR. A program executing on an L* TCB relieves QR constraint similarly to offloading to another CICS region, but without the drawbacks discussed earlier. If enough processing is moved into the OTE, QR constraint can be relieved to where existing AORs can be merged, eliminating the overhead of maintaining multiple regions and function shipping. The goal is to reduce QR processing requirements to where a single CICS region can fully occupy all the processors in a multiprocessor box before a hardware upgrade is required.

What About QR “Blocking”?
Unlike the z/OS dispatcher, the CICS dispatcher only receives control when a program issues an EXEC CICS command. From the time a CICS task is given control to the point it issues a CICS command, that task owns the TCB it’s running on. When the TCB is the QR, this can cause problems. If a program is highly CPU-intensive between CICS commands, it will occupy the QR TCB for an extended time; CICS can’t dispatch any other tasks or handle any incoming transaction requests while it’s running. The resulting spike in transaction response time—starting when the task takes over the QR and continuing past the point where it returns control to CICS while CICS recovers from the backlog of outstanding work—can exceed Service Level Agreement (SLA) limits.

Exploiting OTE is an obvious solution for QR blocking. If the offending program runs on an L* TCB, the QR TCB is unaffected. While overall CPU utilization in the Logical Partition (LPAR) will spike, the QR TCB remains available to service transactions as usual and no increase in response times occurs. An unanticipated benefit of moving a QR blocking program to the OTE is that the response time of the offending transaction is also improved due to the elimination of the region slowdown that follows a QR block.

OTE exploitation reduces hardware costs by removing CICS’ reliance on a single, fast processor via multiprocessor exploitation. Hardware upgrades are required when the capacity of the box is reached, not the capacity of a single processor. Multiprocessor exploitation also eliminates the bottlenecks caused by QR blocking, providing more consistent transaction response times, even during peak workload. The QR constraint relief OTE exploitation provides can remove or delay the introduction of additional CICS AORs, or accommodate collapsing an existing CICS complex into fewer regions—which saves additional CPU, further delaying a hardware upgrade.

TNSTAAFL
There is, indeed, no such thing as a free lunch. While the benefits of the OTE are real and readily achieved, care must be taken to ensure that only threadsafe-compliant code runs on an L* TCB and changes are required to force programs from the QR to an L* TCB. Potential pitfalls to OTE are few and can be avoided with proper planning and a firm understanding of the principles involved.

It’s not possible to test for threadsafe-compliant code; threadsafe is a matter of intent and can only be determined through a code review by a programmer familiar with the application. There are three minimum standards that must be met before a program is threadsafe-compliant:

• If the program is COBOL or PL/1, it must be Language Environment (LE)-enabled.
• The program code must be reentrant.
• The load module must have been linked with the RENT (reentrant) option.

Achieving reentrancy in COBOL or PL/1 is a matter of using the RENT option on the compile step. Assembler programs must be manually reviewed to ensure reentrancy. Once programs have been recompiled/reviewed and linked with RENT, use the CICS start-up option RENTPGM=PROTECT to test for reentrancy. When RENTPGM=PROTECT is specified, CICS loads all programs linked as RENT into read-only storage. If the program turns out to not be truly reentrant, CICS will issue message DFHSR0622, and the task will abend with an ASRA. Programs that can’t run successfully with CICS reentrant program protection active shouldn’t be considered threadsafe-compliant.

Identifying Non-Threadsafe Code
IBM defines a threadsafe program as “one that doesn’t modify any area of storage that can be modified by any other program at the same time, and doesn’t depend on any area of shared storage remaining consistent between machine instructions.”

If your program doesn’t access any “shared” storage areas in CICS, then it’s already threadsafe. If your program does access shared storage, someone must (manually) review the program to determine if the access is threadsafe, and if not, how to serialize the access to make it threadsafe. We will discuss the details of how to perform a threadsafe review and techniques to convert non-threadsafe programs in a follow-up article.

Forcing Programs Into the OTE
Programs are directed to execute on an L* TCB either implicitly by calling a CICS Task Related User Exit (TRUE) that has been enabled with the OPENAPI parm, or explicitly based on parameters in the program’s CICS definition.

When a program issues a call to an OPENAPI TRUE, CICS moves the task to an L8 TCB before giving control to the TRUE program. If the program is defined to CICS as THREADSAFE, it will be allowed to remain on that L8 TCB until it issues a non-threadsafe CICS command. When the OTE was first introduced, DB2 was the only OPENAPI TRUE supported by IBM; since then, IBM has added OPENAPI support for several CICS products, including native sockets, MQ Series and CICS/DLI. This is an implicit OTE access because the program has no control over how the TRUE was enabled and can’t determine when it’s running on the L* TCB.

The CONCURRENCY and Application Program Interface (API) parameters of the PROGRAM definition are used to explicitly force a program to run in the OTE. If a program is defined as CONCURRENCY(THREADSAFE) and API=OPENAPI, all its code will run on an L* TCB. If a program is defined as CONCURRENCY(REQUIRED), all its code will run on an L8 TCB. Note that in the first case, the program could run under either an L8 or L9 TCB. The only time a program will run on an L9 TCB is if the CICS region has storage protection active, this program’s EXECKEY is set to USER and API is OPENAPI. Before attempting to exploit the L9 TCB environment, be aware that significant performance problems may arise when programs on an L9 TCB either issue calls to OPENAPI TRUEs or issue many non-threadsafe CICS commands. To help customers using storage protection and the OTE avoid these problems, IBM added the CONCURRENCY(REQUIRED) option. If a program that would have run under an L9 TCB is defined as REQUIRED, it will instead be initialized on an L8 TCB. 

Problem Determination With Non-Threadsafe Programs
Because it isn’t possible to test for threadsafe compliance, even the most thorough threadsafe conversion can result in a non-threadsafe program being allowed to run on an L* TCB. Unfortunately, when a program that isn’t coded to threadsafe standards runs in the OTE, truly unpredictable results will occur. These results will show up at random times (usually at peak processing) because they’re due to the actions of multiple, independent CICS tasks running programs that access the exact same storage area simultaneously. In some cases, the result of non-threadsafe activity will be invalid data results rather than an abend.

If non-threadsafe errors are suspected following a threadsafe conversion, use these guidelines to help identify the root cause:

• Ensure all the programs involved were compiled and linked as reentrant and that RENTPGM=PROTECT has been specified for the region. Use CEMT INQ SYS to determine the reentrant program protection value. Enforcing program reentrancy is the only threadsafe prerequisite that can be proved. With RENTPGM=PROTECT, any program that isn’t reentrant will abend when tested, providing the exact location of the invalid program code. Still, having non-reentrant programs running in the OTE is the number one cause of threadsafe errors.
• “Impossible” results are likely to be the result of threadsafe errors. Following a threadsafe conversion, if a user or programmer reports results that seem impossible, they probably are the result of a threadsafe error.
• The CICS trace table is invaluable in resolving suspected threadsafe errors; ensure your trace table has sufficient storage to provide trace entries from task initialization to task termination for your average task.
• If a program abend is suspected to be the result of a threadsafe error, it’s likely the transaction dump alone won’t have sufficient information to resolve the problem. If so, set CICS to take a full system dump and attempt to re-create the problem.
• If you suspect that incorrect data is the result of a non-abending threadsafe error, use CICS auxiliary trace to help narrow down the cause. You may find your current allocations for DFHAUXT and DFHBUXT aren’t large enough to hold the trace you need; if so, increase their allocations and run the trace again.

Conclusion
Not every CICS environment will benefit by exploiting the OTE. In some cases, the cost of a threadsafe conversion might exceed any anticipated performance gains; in others, a combination of non-threadsafe commands and non-OPENAPI TRUEs might result in CPU increases when programs are defined to run in the OTE. Most shops, however, will see positive results from a threadsafe conversion in the areas described here.