Jan 1 ’04

“PLATUNE”: Taking Back the Data Center

by Editor in z/Journal

It was the worst ever. Month-end was here and they had laid siege to us. Egad! Now I know how they felt at Dunkirk. We fell back to the machine room. They charged, lobbing jobs at us at horrific rates. “Who is it?” cried management. “It’s the ad hoc users,” we replied, “and a cohort of developers on our flank!” Management turned and ran. “OK, men,” I rallied. “It’s up to us now. We’re going to take back the data center!”

This article examines how a squad of performance management specialists organized and streamlined batch operations using automation and other vendor products.

OVERRUN

“You guys hold this position,” I shouted over the crushing din. “I’m going out on reconnaissance.” I dashed out of the machine room, where we had holed up, and made my way to our office area. A phone went off near me; instinctively, I hit the deck. I crawled over to it and checked the caller ID. Sacred bovine! It was Bob Zimway, one of our ad hoc power users. I gritted my teeth and lifted the receiver off the hook as if I were disarming a booby trap. “Hello,” I croaked into the phone. He instantly commenced his verbal assault. I did my best to take the call, cramming my words in between his undeleted expletives. “Yes, Mr. Zimway, I know who you are,” I grated. He went off again like a phosphorus grenade. “We’re doing the best we can,” I rasped. He was having none of it. “A group of actuaries launched a surprise attack early this morning and submitted a whole raft of long-running jobs.” I was shouting by now. “There was nothing we could do!” I heard him fill his lungs deeply with a witty rejoinder, so I cut him off. “We’ll review the situation and do what we can” I said, and hung up the phone as hard as I dared.

I quickly checked z/OS’ System Display and Search Facility (SDSF), and TMON, and then dodged my way back to the machine room to report. I signaled a huddle and my fellow squad members rounded on my position. There was Sgt. Vincent, management liaison and group leader; Jurgenson, our electronic surveillance specialist, qualified on all systems monitors; Meyer, counterintelligence expert, he worked undercover in development back in ’88—he knew the deepest recesses of their minds; and my buddy, Brewer; he defected from open systems a few years ago. Fortunately, he got out in time, before any real damage was done. I was on point that day.

“It’s bad,” I spat out, gasping for air. “The queues are backed up with jobs from all departments. It looks like most of the initiators are tied up with long-running jobs the actuaries submitted this morning. If any more jobs like that initiate . . . throughput will go to zero.” We gave each other hard looks as I caught my breath. “There’s more,” I said in a warning tone. My team studied me intently. “They got the tape drives . . . all of them! They never had a chance.”

These were seasoned veterans, but this was hard to take. “Are there allocation waits?” Sgt. Vincent asked, dourly. I couldn’t look him in the eye. “Yes, Sarge,” I replied quietly to the raised floor. No one spoke for a few minutes. You could cut the tension with a hacksaw. When young Brewer suddenly broke the silence, his words tore the air. “We can’t let them do this!” he barked. “No, we can’t,” agreed Vincent. Jurgenson and Meyer perked up. “It’s time to prepare for battle!” Vincent ordered with savage determination. Suddenly, the room became bright with hope. “I have some ideas I want to work on,” I said. “I’ll check TMON and see what I can do to get some more cycles back,” Jurgenson said. “And I’ll make a few phone calls,” Meyer added. With shoulders back and chins up, we strode out of the machine room and returned to our burlap bunkers, filled with a sense of purpose. We had a mission.

REGROUP

Sgt. Vincent promptly ordered our squad out on patrol and we discovered a number of problems. Several opportunistic actuaries had submitted large numbers of jobs at dawn, monopolizing the initiators. Meyer’s intelligence report indicated they entered the office under the cover of darkness. It was hard not to admire such a dedicated adversary. Many of these jobs were long-running. Once a long-running job started, it rendered the initiator ineffective for processing other jobs. One by one, initiators fell to invading forces and their occupation began in earnest. Jobs piled up behind them like refugees at a border crossing. We were hard-pressed to even react to this dastardly tactic. Our only option was an end-run around Workload Manager (WLM), manually starting jobs with the JES2 $SJ command. This is an effective countermeasure, but we do not want to use it too often. We run WLM-controlled initiators, allowing WLM to optimize initiator management.

Initial media reports indicated that the actuaries had taken control of all tape resources, but these reports were generalized speculations, foisted upon us by embedded news correspondents. Brewer infiltrated the tape library to serve as forward observer and discovered that developers were also conducting extensive data recovery operations. Their jobs captured and controlled numerous tape units. Brewer also noted that the local citizens were conscripted to fetch countless tapes one at a time, dooming them to hopeless drudgery. He could see no way to free them without a full, frontal assault that was certain to result in massive job cancellations.

Jurgenson made extensive use of his ASG-TMON electronic surveillance equipment and discovered that the initial, swift progress of many jobs had been summarily halted. His full-bore drilldown indicated significant data set contention between developer jobs. He made generous use of his RMF Monitor III to confirm his sighting. He also reported that hostilities had spilled over into the production area, causing delays in critical production batch processing, thus destabilizing the online regions. Jobs in the production area are plainly noncombatants. What did we have to do, rename them all UN* something? Don’t these barbaric developers have any respect for accepted codes of conduct? Our Shop Standards Manual, regulation UR2-L8, clearly states: “No nonproduction job shall allocate any production data set during such time that said production data set may be allocated for exclusive use by aforementioned production job.” There would be no avoiding casualties this day.

WAR PLANS

Once all bogies were identified and isolated, strategic command formed a task force to study the problems and devise plans of attack. Our initial thrust would be to out-flank the actuaries and regain control of the initiators. Once this ground was secured, we would advance on the tape library, scatter the developers, and liberate the local inhabitants. Our final push would be to secure and protect our production data sets. Supreme command issued strict orders that production data sets would be protected at all costs. This ultimate objective was critical to the overall success of our mission.

We recognized that the enemy had deeply infiltrated the company. It could be anyone: the polite fellow guarding the water cooler; the nice lady patrolling the hallways. Friend or foe? It was impossible to tell them apart just by looking at them. We needed intelligence, and we needed it fast. We turned to our CANeuMICS database for information, where we kept detailed dossiers on every user and job in the system. We began mining data to separate the decent citizens from the irregular militia from the subversive groups. We were hunting for jobs of mass consumption.

Examining history data, we learned the routines of our users. We categorized their jobs by resource requirements and arrival rate. We classified jobs by CPU time. We grouped jobs by tape requirements. We knew more about our users’ habits than they, themselves knew. Given all this information, we rendered our target coordinates. We knew just where to strike and how. We aimed our photonic cannon and began to fire!

COUNTER-ATTACK!

Actually, we don’t have a photonic cannon. But, we do have a howitzer in our arsenal. It’s our mother of all batch-tuning tools—a product named ThruPut Manager from MVS Solutions.

We launched our first counter-strike using ThruPut Manager Job Limiting Services (JLS), with a scheme to control the number of concurrent initiators a user can occupy. JLS provides job-limiting agents, which can be used to represent system resources. A JLS agent has a threshold value associated with it. One or more JLS agents can be tagged to a job. When a JLS agent is tagged to a job, a weight is assigned, which represents that job’s use of the agent. JLS will allow a job to initiate only when the job’s added weight does not cause the agent’s threshold to be exceeded. For our initiator management strategy, each job was assigned a JLS agent, such as INITS.userid, with a weight of one and a conservative threshold of three. This effectively limited or capped the number of jobs a user could run concurrently to three.

So far, our operation proved quite successful. A few minor political skirmishes flared up, but they were quickly extinguished. Most people agreed that three concurrent jobs were fair and generally adequate. We received some complaints that once three long-running jobs initiated, the user was unable to do any other work—even a quick print job. This seemed a reasonable complaint, so we added another JLS agent named LARGEJOB.userid with a threshold of two. This agent was tagged, depending upon the job’s CPU requirements. “Fast” jobs (those requiring 13 CPU seconds or less) did not receive this agent; all other jobs were tagged with it. The LARGEJOB. userid agent ensured that a user’s short-running jobs would always be processed in a timely manner. After this scheme burned in, we were able to increase the INITS.userid and LARGEJOB. userid thresholds to five and four, respectively.

SECURE THE PERIMETER

Sometime later, Meyer tickled his covert assets and discovered a developer plot to probe our perimeter for weaknesses. He captured and interrogated a programmer trainee who revealed that developers were running clandestine jobs under alternate user IDs by way of the CA-7/Personal Scheduling product. These people love anarchy like they love to breathe. Their actions propagandized the job queues, making target identification difficult for the scheduling department, which is responsible for tracking these queues. In response, we took advantage of a field in our ACF2 user LIDREC (Logon ID RECord; the anchor point for each defined user). In our shop, jobs submitted from the CA-7/Personal Scheduling product all have similar job names and user IDs, such as UCC####X. The UCC#### LIDREC contains a locally defined field where we store an alternate user ID. This alternate user ID matches a person’s normal working user ID. We laid down a smoke screen and, using mirrors, coded ThruPut Manager Exit 1 to extract this field and store it in a ThruPut Manager variable. Now, UCC jobs are tagged with the INITS.userid agent using the alternate user ID, which closed the bulge in our front lines.

After a brief furlough, we mounted our invasion of the tape library. Our earlier intelligence work and experience with initiator management revealed that our JLS agent maneuver would be extremely effective in the close combat that is tape management. We devised a tactic to tag jobs with JLS agents such as MAX3490.userid, MAX3590.userid, and MAXVIRT.userid, representing our 3490, 3590, and virtual tape pools, respectively. The agent weights were set to the maximum number of tape units of each type used in a job step. Thresholds for these agents were set to allow the user a maximum of eight concurrent 3490 tape units, four concurrent 3590 tape units, and 12 concurrent virtual tape units. We made a leaflet drop announcing the new standard and then fired for effect. We fought from silo to silo, but we didn’t lose a single drive unit. Some occasional tweaking of agent thresholds is required when belligerents request additional resources for special-forces projects. This is a simple matter of issuing console commands, when necessary.

VIVE LA LIBRAIRIE DE CARTOUCHE!

Having acquired command and control of the tape hardware, we now focused our efforts on liberating the tape librarians. Taking this objective would require overwhelming force, heavy ordnance . . . a bunker buster. Tape jobs typically request and mount tapes “by the ones.” Each tape mount has a corresponding tape fetch, where someone or something must retrieve the stored tape volume. Modern robotic hardware minimizes fetch time when the volume is located inside the silo. However, when the volume is located outside the silo, slow human intervention is necessary. Most shops cannot afford to automate every volume, and even the most astute use of silo content management techniques cannot ensure constant, cost-effective locality of reference. Manual tape handling is an unavoidable random access procedure.

We deployed the ThruPut Manager Robotics Setup Services (RSS) component to free the tape librarians. RSS ensures that all tape volumes, necessary for the execution of each job, are located where they are required prior to job initiation. Directed by console displays and/or picking lists, tape librarians fetch and enter tape volumes into the robotic silos. When all volumes are properly located, the job requiring them is allowed to initiate. RSS changes the manual, random access tape fetch into a sequential batch-like operation, significantly improving the efficiency of the tape librarians.

The reduction of tape mount time to robotic speed minimizes job execution time. The robot immediately mounts the volumes as requested, avoiding any intervening manual operation to fetch the tape from the library stacks and enter it. The more volumes a job requires, the greater the improvement. For instance, the execution time of a 20- volume job is easily reduced 20 to 40 minutes by prestaging tape volumes. Jobs fire through volumes like they were magazine-fed, instead of single shot.

RSS also minimizes the amount of time that tape units are allocated and idle, thus maximizing physical and virtual tape hardware utilization. Expensive tape units no longer sit, waiting for a tape librarian, who is off using the latrine, to fetch and enter the next volume requested.

THE BIG PUSH

This was the turning point of our campaign. We were on the outskirts of batch supremacy. We were still outnumbered, but they were a largely ineffective and poorly led force. We had limited their ability to move jobs freely through the JES2 subsystem. However, we could run jobs, under our terms, in the place and time of our choosing. We had reduced the enemy to singular guerrilla incursions, but they were still as dangerous as an unexploded mortar round. They had read-access to production data sets, and they could still cripple us with allocation waits.

We began our big push on a cold, Monday morning. We flew in low, under their radar. We had a secret weapon: We would hit them without their ever knowing it. We deployed a package consisting of ThruPut Manager Dataset Contention Services (DCS) and CA-OPS/MVS automated operations scripts. We delivered the one-two punch.

DCS controls and prioritizes access to data sets by batch jobs. It ensures that no job initiates unless it can allocate all required data sets. DCS has three rules of engagement: Standby, Contend, and Claim. We use Standby service for nonproduction jobs and Contend service for production jobs. DCS is cognizant of a job’s data set requirements from analysis of the job’s JCL. It learns data set names and whether shared or exclusive control of the data set is required. Jobs with Standby service are allowed to run only when all data sets are available. Contend service is more aggressive. When a Contend service job requires exclusive control of a data set and that data set is in use, DCS dynamically queues itself for exclusive control of the data set on behalf of the job. When DCS receives control of the data set, it allows the job to initiate, passing control of the data set to the job during its allocation phase.

DCS has a sentry option to alert HQ when a job initiation is delayed for data set contention. We enable this option for production jobs. When a production job initiation is delayed, DCS barks a multi-line alert message to the console. This message indicates the name of the delayed job, the data set name(s) in contention, and the holder(s) of the contended data sets. OPS/MVS monitors this message. It scans the data set holder information and determines if the holder( s) is production or nonproduction. We program OPS/MVS to automatically cancel any nonproduction job that delays a production job. As part of the cancellation process, we post messages in the cancelled job’s joblog, indicating why the job was cancelled and the pertinent data set name. This product synergy is our missile defense system. Any problem allocation that is lobbed at us is identified and immediately shot down.

We also enabled a DCS option to order TSO users to free allocated data sets when they cause contention. We use the NAG option, which issues a stiff diplomatic message to TSO users, indicating their culpability in the contention event. It specifies the contended data set name and the name of the job requiring the data set. We extract information from the ACF2 LIDREC and augment the message with the name and phone number of the job submitter. This makes the contention event hand-tohand combat between users, taking us away from the hostilities.

DEBRIEFING

We achieved all of our political objectives:

Management supplied us with the world’s best training and software, and we made the best use of it. I now have stories to tell my grandchildren. When they ask, “What did you do in the war, Grandpa?” I don’t have to say, “I sat in a stinking cubicle and watched men die of thirst, waiting all day for their jobs to finish.” I can tell them we took back our data center!

This article is based on a paper and presentation that will be given at CMG 2003 in Dallas, December 7—12. Z