IT Management

“PLATUNE”: Taking Back the Data Center

3 Pages

Sometime later, Meyer tickled his covert assets and discovered a developer plot to probe our perimeter for weaknesses. He captured and interrogated a programmer trainee who revealed that developers were running clandestine jobs under alternate user IDs by way of the CA-7/Personal Scheduling product. These people love anarchy like they love to breathe. Their actions propagandized the job queues, making target identification difficult for the scheduling department, which is responsible for tracking these queues. In response, we took advantage of a field in our ACF2 user LIDREC (Logon ID RECord; the anchor point for each defined user). In our shop, jobs submitted from the CA-7/Personal Scheduling product all have similar job names and user IDs, such as UCC####X. The UCC#### LIDREC contains a locally defined field where we store an alternate user ID. This alternate user ID matches a person’s normal working user ID. We laid down a smoke screen and, using mirrors, coded ThruPut Manager Exit 1 to extract this field and store it in a ThruPut Manager variable. Now, UCC jobs are tagged with the INITS.userid agent using the alternate user ID, which closed the bulge in our front lines.

After a brief furlough, we mounted our invasion of the tape library. Our earlier intelligence work and experience with initiator management revealed that our JLS agent maneuver would be extremely effective in the close combat that is tape management. We devised a tactic to tag jobs with JLS agents such as MAX3490.userid, MAX3590.userid, and MAXVIRT.userid, representing our 3490, 3590, and virtual tape pools, respectively. The agent weights were set to the maximum number of tape units of each type used in a job step. Thresholds for these agents were set to allow the user a maximum of eight concurrent 3490 tape units, four concurrent 3590 tape units, and 12 concurrent virtual tape units. We made a leaflet drop announcing the new standard and then fired for effect. We fought from silo to silo, but we didn’t lose a single drive unit. Some occasional tweaking of agent thresholds is required when belligerents request additional resources for special-forces projects. This is a simple matter of issuing console commands, when necessary.

VIVE LA LIBRAIRIE DE CARTOUCHE!

Having acquired command and control of the tape hardware, we now focused our efforts on liberating the tape librarians. Taking this objective would require overwhelming force, heavy ordnance . . . a bunker buster. Tape jobs typically request and mount tapes “by the ones.” Each tape mount has a corresponding tape fetch, where someone or something must retrieve the stored tape volume. Modern robotic hardware minimizes fetch time when the volume is located inside the silo. However, when the volume is located outside the silo, slow human intervention is necessary. Most shops cannot afford to automate every volume, and even the most astute use of silo content management techniques cannot ensure constant, cost-effective locality of reference. Manual tape handling is an unavoidable random access procedure.

We deployed the ThruPut Manager Robotics Setup Services (RSS) component to free the tape librarians. RSS ensures that all tape volumes, necessary for the execution of each job, are located where they are required prior to job initiation. Directed by console displays and/or picking lists, tape librarians fetch and enter tape volumes into the robotic silos. When all volumes are properly located, the job requiring them is allowed to initiate. RSS changes the manual, random access tape fetch into a sequential batch-like operation, significantly improving the efficiency of the tape librarians.

The reduction of tape mount time to robotic speed minimizes job execution time. The robot immediately mounts the volumes as requested, avoiding any intervening manual operation to fetch the tape from the library stacks and enter it. The more volumes a job requires, the greater the improvement. For instance, the execution time of a 20- volume job is easily reduced 20 to 40 minutes by prestaging tape volumes. Jobs fire through volumes like they were magazine-fed, instead of single shot.

RSS also minimizes the amount of time that tape units are allocated and idle, thus maximizing physical and virtual tape hardware utilization. Expensive tape units no longer sit, waiting for a tape librarian, who is off using the latrine, to fetch and enter the next volume requested.

THE BIG PUSH

This was the turning point of our campaign. We were on the outskirts of batch supremacy. We were still outnumbered, but they were a largely ineffective and poorly led force. We had limited their ability to move jobs freely through the JES2 subsystem. However, we could run jobs, under our terms, in the place and time of our choosing. We had reduced the enemy to singular guerrilla incursions, but they were still as dangerous as an unexploded mortar round. They had read-access to production data sets, and they could still cripple us with allocation waits.

We began our big push on a cold, Monday morning. We flew in low, under their radar. We had a secret weapon: We would hit them without their ever knowing it. We deployed a package consisting of ThruPut Manager Dataset Contention Services (DCS) and CA-OPS/MVS automated operations scripts. We delivered the one-two punch.

DCS controls and prioritizes access to data sets by batch jobs. It ensures that no job initiates unless it can allocate all required data sets. DCS has three rules of engagement: Standby, Contend, and Claim. We use Standby service for nonproduction jobs and Contend service for production jobs. DCS is cognizant of a job’s data set requirements from analysis of the job’s JCL. It learns data set names and whether shared or exclusive control of the data set is required. Jobs with Standby service are allowed to run only when all data sets are available. Contend service is more aggressive. When a Contend service job requires exclusive control of a data set and that data set is in use, DCS dynamically queues itself for exclusive control of the data set on behalf of the job. When DCS receives control of the data set, it allows the job to initiate, passing control of the data set to the job during its allocation phase.

DCS has a sentry option to alert HQ when a job initiation is delayed for data set contention. We enable this option for production jobs. When a production job initiation is delayed, DCS barks a multi-line alert message to the console. This message indicates the name of the delayed job, the data set name(s) in contention, and the holder(s) of the contended data sets. OPS/MVS monitors this message. It scans the data set holder information and determines if the holder( s) is production or nonproduction. We program OPS/MVS to automatically cancel any nonproduction job that delays a production job. As part of the cancellation process, we post messages in the cancelled job’s joblog, indicating why the job was cancelled and the pertinent data set name. This product synergy is our missile defense system. Any problem allocation that is lobbed at us is identified and immediately shot down.

We also enabled a DCS option to order TSO users to free allocated data sets when they cause contention. We use the NAG option, which issues a stiff diplomatic message to TSO users, indicating their culpability in the contention event. It specifies the contended data set name and the name of the job requiring the data set. We extract information from the ACF2 LIDREC and augment the message with the name and phone number of the job submitter. This makes the contention event hand-tohand combat between users, taking us away from the hostilities.

DEBRIEFING

We achieved all of our political objectives:

  • We ended the anarchical control of initiators and tape units by aggressive users via judicious use of JLS limiting agents. All users now receive fair and equal access to computing resources. Control is fully automatic, yet flexible, when circumstances dictate.
  • We liberated the tape librarians and saved their valuable, natural resources by deploying RSS. Thus, tape operators are more efficient, job elapsed time is minimized, and tape hardware investment is maximized.
  • Coalition forces secured and protected our production data sets from marauders through a combination of OPS/MVS and DCS. Contention waits are reduced to near zero. Production data sets are now available to test jobs, as much as possible, as long as they do not interfere with production operation. Automation has proved itself a valuable ally in these joint operations.

Management supplied us with the world’s best training and software, and we made the best use of it. I now have stories to tell my grandchildren. When they ask, “What did you do in the war, Grandpa?” I don’t have to say, “I sat in a stinking cubicle and watched men die of thirst, waiting all day for their jobs to finish.” I can tell them we took back our data center!

This article is based on a paper and presentation that will be given at CMG 2003 in Dallas, December 7—12. Z

3 Pages