Mar 21 ’14
CICS and the Open Transaction Environment: Are You Sure That Program Is Threadsafe?
Previous articles (see the “References” section) discussed the tremendous benefits accrued from using the CICS Open Transaction Environment (OTE), ranging from improved multiprocessor exploitation to reductions in software license charges. In this final article, we take a closer look at the CPU savings seen in large CICS/DB2 applications after they’re converted to threadsafe, as well as the critical elements of a successful conversion, including:
• How to identify programs that offer the greatest potential savings
• How to determine the actual savings available compared to the potential savings
• How to identify non-threadsafe application or system code
• Common methods to convert non-threadsafe code to meet threadsafe standards
• When (and why) a two-phase threadsafe conversion is appropriate.
It’s important to remember that the OTE by itself doesn’t reduce CPU consumption. CICS/DB2 programs that aren’t defined as threadsafe require two task control block (TCB) spins for each DB2 request issued, while CICS/DB2 programs that are defined as threadsafe can often avoid any TCB spins when issuing a DB2 request. Threadsafe CICS/DB2 programs running in the OTE use less CPU because CICS does less work to support them.
The operating system services DB2 system code uses can’t run on the quasi-reentrant (QR) TCB, so IBM uses a task-related user exit (TRUE) to intercept an SQL call and move the requesting task to another TCB where the DB2 code can be processed safely. (Note: The restrictions on what processing can be performed on the QR TCB are discussed in detail in the previous articles in this series.)
Once DB2 has processed the request, the requesting task is moved back to the QR to continue executing, resulting in two TCB spins per request. When IBM converted these TCBs to L8 TCBs, it allowed threadsafe programs to remain on the L8 TCB after DB2 completed the call. If the program is still running on an L8 TCB when it issues another SQL call, no TCB spinning is required, resulting in CPU savings. The program could potentially remain on the L8 until it terminates, eliminating the requirement for all but the spin required to service the first SQL call. The program’s potential TCB spin reduction is equal to 2 x(Number of SQL calls - 1). It’s important to remember that while the examples shown relate to DB2, any OPENAPI-enabled TRUE would work this same way; if your shop uses IDMS, substitute the word “IDMS” wherever “DB2” appears in this article.
Unfortunately, these TCB spins aren’t always avoidable, because the task will be spun back to the QR if the application program issues a non-threadsafe CICS command. Not all CICS system code is written to threadsafe standards; for example, a portion of the system code used to process an EXEC CICS DELAY is non-threadsafe. CICS serializes non-threadsafe commands by always executing them on the QR TCB. If a program is running on an L8 and issues a DELAY, CICS recognizes the non-threadsafe command and spins the task to the QR. When control is returned to the program, it’s left on the QR, so a subsequent SQL call will result in the task being spun back to the L8. Threadsafe programs that issue non-threadsafe commands can’t achieve their full potential CPU savings. This is becoming less of a problem since the introduction of the OTE, as IBM has gradually converted the bulk of all EXEC CICS commands to be threadsafe, reducing the scope of this issue significantly.
Identifying Programs That Offer the Greatest Potential Savings
You can identify programs with the greatest potential savings by using CICS’s built-in monitoring capabilities, although any standard CICS monitoring tool will make the process faster and easier. CICS statistics are written to the SMF data sets and are processed using the IBM-supplied program DFH$MOLS. The output supplied by DFH$MOLS is much more detailed than required for our purposes here; we only want to look at the number of TCB switches this task required when processing compared to the number of DB2 requests, as shown in Figure 1. The field “DB2REQCT” shows the number of DB2 requests made; “USRCPUT” shows the total amount of CPU consumed by this task and the total number of TCB switches (29,763) encountered. In this example, we have a potential savings of more than 29,700 TCB switches from converting this task’s DB2 programs to threadsafe.
Determining Actual Savings vs. Potential Savings
The potential savings we’ve identified can be quite different from the actual savings received based on the number and location of non-threadsafe EXEC CICS commands in the programs executed. It would be easy to find the actual savings available to this task if we could simply re-define all the programs it uses to be threadsafe and then generate an updated DFH$MOLS report. We can’t do that in production—never define a program as threadsafe until you’ve completed a thorough threadsafe analysis—but we can do so in test, using the following guidelines:
• Set up an isolated test CICS environment where this test will be the only activity.
• Ensure that CICS monitoring is on in this test region.
• Re-define the transaction(s) in question to be a member of a transaction class.
• Set that transaction class to a maximum of one.
• Execute a pre-defined set of transactions that can be repeated.
• Alter the PROGRAM definitions for all the programs involved to CONCURRENCY=THREADSAFE.
• Re-run the previously executed set of transactions.
• Compare the number of DB2 requests and TCB switches between the first (non-threadsafe) and second (threadsafe) runs.
Warning: Be sure to return the test CICS to its original condition by removing transaction class definitions and program threadsafe definitions.
Before comparing the two sets of results, verify that the number of DB2 calls is roughly the same in both runs. If the number of DB2 calls is different, that’s an indication the before and after runs were different and calls the test results into question.
The key metric is to find the ratio of TCB switches to DB2 calls in the form of TCB switches/2: DB2 calls. A ratio near one indicates that no actual savings was found, as either the test was performed incorrectly or the programs have non-threadsafe commands after almost every DB2 call. A ratio near zero indicates this is the ideal candidate; a threadsafe conversion will result in your actual savings coming very close to your potential savings.
A ratio greater than one indicates that non-threadsafe exits are active in the region. Under no circumstances should you start a threadsafe conversion in an environment where non-threadsafe exits are found. Use the IBM-supplied STAT transaction (program DFH0STAT) with the DB2, user exit and global user exit options selected to produce a complete list of all exits active in the region along with their threadsafe status. If a user exit program supplied by a vendor is defined as QUASIRENT, contact the vendor for an updated version of the exit program; if a threadsafe version of the exit program isn’t available from the vendor, it may indicate that the vendor has, for all practical purposes, stopped supporting the product in question. If a user exit program developed in-house is marked as QUASIRENT, you must convert it to threadsafe before implementing any threadsafe programs into production. Be careful when reviewing user exit programs, as the nature of the exit program environment lends itself to non-threadsafe programming.
Identifying Non-Threadsafe Application or System Code
After identifying the programs that will produce the greatest CPU savings, the next step is to determine if these programs are already threadsafe, or if a threadsafe conversion will be required. Unfortunately, there’s no automated method to prove code is threadsafe, partly because it can be difficult to tell if a given field is in shared storage, and partly because whether access to shared storage is threadsafe depends on how the storage is used.
Begin a threadsafe review by running the IBM load module scanner, DFHEISUP, with the IBM-supplied filter set of DFHEIDTH against the load library containing the programs in question. (Note: Documentation on the use of DFHEISUP is found in the CICS Operations and Utilities manual; the filter set, DFHEIDTH, is found in the SDFHSAMP library.) DFHEISUP will scan the load modules, looking for the EXEC CICS commands listed in the DFHEIDTH table: EXTRACT EXIT, GETMAIN SHARED and ADDRESS CWA. These three commands aren’t the only ways to acquire access to shared storage, but they’re the most common. Any program flagged by DFHEISUP as issuing these commands must be manually reviewed to determine what use is made of the shared area(s).
Figure 2 shows an example of non-threadsafe use of the CICS common work area (CWA.) In this case, a field in the CWA is used as a counter to generate a unique field in a key sequenced data set (KSDS). When program PROG001 builds the key, it copies the current value of 0001 from the CWA to its key field in working storage, increments the counter in CWA by one and writes the record out. This processing isn’t threadsafe because it relies on the CWA counter field remaining unchanged during the period between extracting the current value and updating the counter to its new value.
The result of incorrectly defining program PROG001 as threadsafe while it contains this non-threadsafe routine is shown in Figure 3. In this case, two tasks (TASK1 and TASK2) are running PROG001 simultaneously. TASK1’s PROG001 starts the process by copying the current value of 0001 from the CWA to its key field in working storage. At the same time, TASK2’s PROG001 also copies the value of 0001 to its key field in its working storage. TASK1’s PROG001 then increments the counter in CWA by one to 0002; TASK2’s PROG001 also increments the counter by one—from 0002 to 0003. TASK1 then writes out its KSDS record and continues normally. When TASK2 attempts to write out its record, the key isn’t unique, and the write receives a DUPREC error condition. Since it’s “impossible” for a key not to be unique (after all, that’s what the CWA counter is used for—generating unique key values), it’s likely there’s no special program code to handle the DUPREC condition, and TASK2 fails with a condition that simply can’t occur. Worse, when nightly batch is run, there’s a “missing” record on the KSDS because the counter “jumped” from one to three, and the batch job also fails with an “impossible” condition.
It’s also possible to access shared storage in a manner that’s technically non-threadsafe but is acceptable in practice. Using this same CWA example, operations may want a program that would report how many records had been written so far. In that case, the program would simply extract the current counter value from the CWA, place it in an output message field in working storage and send it to the operator’s console. This activity is non-threadsafe in that the counter value can change between the time the program extracts it and the time the message is sent (in other words, the program could have moved a value of 0001 into its working storage, but the value in the CWA might be 0002 before it issues the send). However, in practice, although the process of building the message in working storage is automatically serialized when the program runs as QUASIRENT, the actual value of the CWA counter can change dramatically by the time the message is received and displayed on the console. In effect, running this non-threadsafe code in a program defined as threadsafe will result in a value that’s only slightly more out-of-date than if the code was converted to threadsafe.
Converting Non-Threadsafe Code to Meet Threadsafe Standards
If non-threadsafe code exists, it must be converted to meet threadsafe standards before the programs are marked as threadsafe. At this point, the benefit of the savings identified earlier must be weighed against the costs of modifying the code to be threadsafe. There are several mechanisms that can be used to serialize access to shared storage (often an EXEC CICS ENQ). However, it may be that using another facility, one that’s automatically serialized, to host the data is the best solution to the problem. For example, removing the counter field from the CWA and replacing it with a named counter would make this process threadsafe. In either case, the process of serialization will add overhead as compared to the in-core counter, reducing the CPU savings generated by the threadsafe conversion and changing the cost/benefit analysis that had previously justified the conversion.
The most efficient method of providing serialized access to an in-core counter value (like the CWA counter in our example) is the use of the Assembler Compare and Swap family of instructions. These instructions allow serialization at the hardware level and have an imperceptible level of CPU overhead. Their drawback is that they’re Assembler code; if the program that requires serialization is written in COBOL or PL/1, a new callable utility must be developed in Assembler to provide the Compare and Swap capability.
While providing serialization of shared resources usually requires some kind of code change, it’s possible under some circumstances to achieve serialization without modifying the program in question. If all the non-threadsafe activity in an application is contained within a small subset of programs, these programs can be left as QUASIRENT. If a program running on an L8 TCB issues a CICS LINK or XCTL to a program that’s defined as QUASIRENT, CICS will spin the task to the QR TCB before giving control to the LINKed-to program; running single-threaded on the QR TCB will provide the level of serialization the program was originally designed around, so no coding changes are required.
When and Why a Two-Phased Threadsafe Conversion Is Appropriate
Allowing part of an application to remain QUASIRENT during a threadsafe conversion will reduce the level of CPU savings that might otherwise have been achieved, but the option of some savings now—without the expense of source code changes—vs. no savings until some time in the future is usually a valid trade-off. In the same way, implementing a threadsafe conversion with programs that still have a large number of non-threadsafe commands will provide partial CPU savings now vs. possibly larger savings if the non-threadsafe commands are modified. In both cases, the possibility of a two-phased threadsafe conversion is raised.
In a two-phased conversion, programs that meet threadsafe standards have their program definitions changed to THREADSAFE, even though their actual CPU savings is substantially below their potential savings. At some later point, when programming resources become available, these programs can be enhanced to (in the first case) bring those remaining programs up to threadsafe requirements, or (in the second case) reduce or eliminate non-threadsafe CICS command activity to further reduce TCB switches. The advantage of a two-phased conversion is that partial CPU savings is realized immediately, helping justify the cost of the programming effort required for phase 2, so the greatest possible savings is eventually achieved.
This series of articles was introduced after we observed that some CICS shops still aren’t fully aware of the mechanics of the CICS OTE or the many benefits of the OTE—both for DB2 and non-DB2 programs. Every CICS environment would see some improvement following a threadsafe conversion, and for many, this improvement would take the form of a reduction in peak CPU requirements that can lead directly to real savings in both hardware and software costs.
Because every threadsafe conversion project is unique, and every shop has its own programming and operational requirements, it’s probable the costs of a threadsafe conversion would sometimes exceed the benefits for certain applications. However, the potential benefits of OTE exploitation are so significant and widespread, that it’s always reasonable to research the impact of OTE exploitation in your specific set of conditions even when it appears unlikely that a threadsafe conversion would be approved. Put simply, the existence of the OTE changes many of the basic parameters around which CICS applications were designed and coded. Performance considerations that were once obvious are no longer obvious; coding conventions (pseudo-reentrancy, for example) that were rigidly enforced aren’t good enough and elaborate coding mechanisms designed to avoid QR CPU blocking can be eliminated. It’s time to accept these changes and start planning how to make them work for you.
“CICS and the Open Transaction Environment: A Retrospective,” http://esmpubs.com/q98ug
“CICS and the Open Transaction Environment: What to Do Without DB2,” http://esmpubs.com/ctmxx.