Sep 1 ’03
Are You Reorganizing Your VSAM Files Too Often?
After reviewing VSAM usage for many years, one of the biggest unrecognized problems I’ve seen was (and still is) directly related to batch window processing and online performance. In fact, the issue is large enough that several vendors have created VSAM add-on enhancement products, but unfortunately, they may actually make this situation worse. What is this horrible problem, you ask? The Swami, knowing the answer even before you asked it, wrote this article to give you the answer!
BACKGROUND: BEFORE VSAM
More than 30 years ago, when we needed files indexed by a key, we fought with ISAM. ISAM had a relatively short life — from the early 1960s to the early 1970s — by which time it had been almost completely replaced by VSAM. One of the main problems with ISAM was how it handled inserted records — they were kept, unblocked, in chains in an “overflow area” at the end of the file. Adding only a few records to a file could produce very long service times for additional inserts and when processing records that had previously been inserted.
Late in ISAM’s life, “Cylinder Overflow Areas” were introduced, where you could reserve one or more tracks at the end of each cylinder, so that the chain processing might be shortened. The “Independent Overflow Area” at the end of the file was still available for use when the Cylinder Overflow Area was exhausted. Even with this enhancement, the inserted records were kept unblocked and in a chain in the overflow area(s), significantly impacting performance.
In those ISAM days, it was not uncommon to build a file with direct insert activity, and to pause after a few hundred inserts to reorganize the file in order to improve its performance. Some of the things we used to do with ISAM may have carried over into our ideas about VSAM, even though they are not generally applicable in the VSAM environment. Similarly, VSAM enhancement product features that will automatically reorganize your file for you based on VSAM insert and split activity should be monitored because in many cases, these reorganizations may not be helpful.
For the past 30 years, VSAM has provided a better way — actually several better ways. When defining a file (cluster in VSAM terminology), you can specify that some percentage of each Control Interval (CI) can be reserved as free space, and some percentage of the CIs in each Control Area (CA) can also be kept free. In the case of files with heavily clustered inserts, these techniques are not very effective, and another VSAM technique — splitting CIs and CAs to create additional free space precisely where needed — can be very useful. Often, we counter the benefits from splits by reorganizing our files more often than we should.
VSAM HANDLES INSERTS WITH CI AND CA SPLITS
CI SPLIT PROCESSING
VSAM always keeps records within DASD blocks (or groups of blocks), called CIs, and if the record size permits, many records can be contained in a CI. It doesn’t matter if the records are all the same length or are variable length. If there is more than one record contained in a CI, the records in that CI will be kept in ascending sequence by their key values.
Figure 1 shows that VSAM has plenty of free space in this CI to permit the insertion of another logical record of the length illustrated. As long as there is sufficient free space to handle the size of the logical record being inserted, it will simply be inserted into the CI in order, and the amount of free space remaining will be reduced accordingly.
If, however, there is insufficient free space within the CI to hold the record being inserted, VSAM’s insert processing must create more free space. This process is called a Control Interval Split (CI Split). VSAM’s insertion strategy always attempts to do the following:
- Keep records of similar keys blocked together
- Avoid unblocked records
- Avoid chaining of individual records
- Create additional free space in the (key) vicinity if it is needed.
A CI Split is pretty inexpensive, as far as computer processing goes. VSAM performs the following steps to create more free space:
- Writes the CI being split with a split-in-progress indicator set in the CIDF field
- Moves (about) half of the records to a new CI buffer in storage
- Writes the new CI from that buffer
- Removes the records that were moved from the old CI buffer in storage
- Updates the Sequence Set (low-level) Index record to reflect the new CI and key changes
- Writes the old CI (without the moved records), resetting the split-in-progress indicator.
As you can see, this process required only four I/O operations to complete — the remainder of the processing was all in-storage activity.
EFFECTS OF CI SPLITS
There is little additional processing time involved in processing data after the CI Split has been completed. Two CIs now contain the records formerly contained in one, plus the new inserted record. There will be no additional I/O activity to process the records in either of the CIs when direct processing is being done, as is typical of CICS and other online activity. Batch jobs will have to read an additional CI when processing sequentially, but that is a small expenditure, and if tuned well, will likely require little or no more elapsed and I/O time given modern, cached DASD subsystems.
BENEFITS OF CI SPLIT PROCESSING
As you can see, when free space needed to handle the insertion of a new record was unavailable, VSAM’s insertion strategy caused additional free space to be created right where it was needed.
It is common (but not certain) that additional records will also be inserted in the same vicinity. This clustering of insert activity arises from the file and key designs in many cases, and from the natural processing flow in applications:
- Most inserts are at the end-of-file point as keys are created in continually ascending key sequence (using a time stamp or sequence number) — one principal insertion point.
- Most inserts are at the end of a range of keys (i.e., new accounts are opened in a branch banking situation) — there could be multiple insertion points in this case.
- Inserts are more scattered, but still clustered (i.e., new course information for several classes is added to a student’s record during college registration).
Split processing, then, is beneficial. Suppose we reorganize the file and restore all free space to the initial load configuration — what happens then? All the extra free space that we created through the CI Split process is removed, and future record insertions may have to re-create the free space over a period of time. Note that extending the length of a record is logically the same as adding a new record in this case.
CA SPLIT PROCESSING
In some cases, there is no free CI available in the CA. In this case, VSAM needs to create free CIs within the next highest level in its file organizational hierarchy — and a CA Split is used to create the need- ed free space. A CA is a group of CIs stored physically near each other and indexed by a single Sequence Set (lowlevel) Index record. Within a CA, some number of whole CIs can be left as CA free space during initial load, and will provide space to be used by VSAM when CI Splits are needed.
When an insert is to be done, but no free space exists in the target CI, and no free CI exists in the CA, a CA Split is required. This is similar to a CI Split, but is much more time-consuming.
A CA is a group of CIs that can contain a large number of physical records or blocks. A CA can be as large as one cylinder of space on the DASD device in use. For example, for an IBM 3390 device (or one of its many equivalents) with 4,096-byte CIs, VSAM will write 12 CIs per track and 180 CIs per cylinder for a total of 720KB. If the primary and secondary allocation amounts for this file are each larger than one cylinder, then the size of one cylinder will be used for the CA and it will contain 180 CIs.
If you had a file with frequent sequential processing, you might have chosen a larger CI size — for example, 16,384 bytes. Then, the CA size would only contain 45 CIs (three per track) for a total of 720KB. Had you chosen 18,432- byte CIs, VSE/VSAM could have also stored three of these on each track, still giving 45 CIs per CA, but 810KB per cylinder. DFP/VSAM may use extended format when writing DASD blocks, and therefore, may be restricted to three 16KB CIs per track. Larger CI sizes:
- Improve performance of VSAM sequential performance
- Greatly speed processing of inserts when CA Splits are needed
- Use DASD space more efficiently.
It is often thought that shorter CIs are critical to direct performance. With today’s DASD subsystems, FICON and ESCON channels, and the ability to cache large numbers of records in main storage buffers, the gain in I/O response time from shorter CI sizes is much less important than it was just 10 years ago.
First, let’s look at the contents of a VSAM Control Area (see Figure 2). This illustration shows that two free CIs exist in the CA and can be used for CI Splits without the need for CA Split processing. When two CI Splits have occurred in this CA, there will be no free CIs and then a CA split will be needed.
VSAM’s insertion strategy, described previously, applies to CA Splits as well as CI Splits — VSAM attempts to keep a significant number of CIs together in a group to improve processing performance.
Unlike CI Splits, CA Splits can take a lot of processing and I/O resources. The general outline of CA Split processing is similar to CI Split processing. In a CA Split, VSAM performs the following steps:
- Writes the Sequence Set Index CI with a split-in-progress indicator set
- Formats a new CA at the end of the data set (based on the High Used RBA value in the catalog)
- Moves (about) half of the CIs to the new CA — this requires reads and writes of each CI being moved and can amount to hundreds of I/O operations with small CI sizes (4,096 bytes and smaller)
- Creates a new Sequence Set (low-level) Index record for the new CA
- Removes the CIs, which were moved from the old CA
- Updates the Sequence Set (low-level) Index record to reflect removal of the CIs
- Updates the higher-level index records as needed.
As you can see, this process required many I/O operations to complete — hundreds when many small Data CIs exist in each CA. Larger Data CIs (and fewer Data CIs per CA) will reduce the cost of CA Splits.
A completed split, in contrast, has only a small impact on subsequent processing. A significant amount of data is still grouped into each of the (now two) CAs. Direct processing, the primary processing method for online systems, is not affected, and sequential processing will only be slightly impacted. Most important, however, is the positive effect on additional insert activity in the same key value vicinity. There are now two CAs, each approximately half full of free space. Many future inserts in this vicinity will at most require CI Splits — many inserts will be accommodated in this new free CI before another CA Split is required (see Figure 3).
REORGANIZATION MAY NOT BE RIGHT FOR YOUR FILE
The purpose of writing this article was to encourage you to examine your file reorganization strategy. If splits actually can help subsequent insert processing, you should not use the number of CI or CA Splits that have occurred as a trigger to cause reorganization, whether you do this manually or with a vendor VSAM enhancement product. VSAM file reorganization (or reloading the file) will:
- Squeeze out any additional free space that was created in the CIs and CAs of the file by split activity
- Move the records from CAs moved to the end of the file by CA splits (for example, CA 45 as illustrated in Figure 3) back into physical sequence
- Repopulate the file with the initial free space defined in the DEFINE CLUSTER command.
If your file has any clustering of insert activity, reorganization may do more harm than good. Future clustered inserts may cause splits again in the same places. You may be able to avoid some of these additional splits by performing frequent reorganization, but that was the statement of the problem — batch processing window problems.
HOW CAN I TELL IF I HAVE CLUSTERED INSERTS?
I can think of two ways:
- Ask the application owner, designer, or programmer about his key values and insert activity. This is somewhat less than perfect, as real systems often work differently than was assumed during their design.
- Check LISTCAT or other statistics that show the number of CI and CA Splits that have occurred. LISTCAT statistics are cumulative and you need to see how many new splits of each type have occurred each day. In the statistics for files with heavily clustered inserts, you will see the number of new splits start at a low level (depending on the amount of distributed free space), increase to a higher level, and decline over time.
If you track these statistics over a period of two or more reorganization cycles, you may see that the total number of splits performed increases after reorganization, compared with the number the file would have experienced if it had not been reorganized.
Too frequent reorganization may appear to reduce the number of splits required, but you need to trade this savings against:
- The batch cycle and online unavailability costs during reorganization
- Excess disk space used by unusable distributed free space throughout a file with clustered insert activity
- CPU, I/O, and other system resources expended in the reorganization.
In many cases, you will find you are reorganizing files that must be backed up, but that do not need to be reorganized. Retaining the free space you paid for through CI and CA Split processing will be a better plan if the file has clustered insertion activity. Eliminating unproductive reorganization processing can shorten the batch window and result in higher online availability. Eliminating unusable distributed free space throughout a file when insert activity is clustered can also save significant amounts of disk space.