Aug 16 ’10

Performance Management Essentials to IT Success

by Editor in z/Journal

The Airline Tariff Publishing Company (ATPCO)—the Washington, DC-based central clearinghouse and publisher of airline fares and fare-related data—has been focusing on controlling costs to ensure customer satisfaction. The technology company’s IT division has focused on improving system performance, which reduces costs and frees up resources for new products and services.

ATPCO processes fare data from more than 500 airlines and sends that data to global distribution systems, computer reservations systems, and other travel-related companies around the world. ATPCO processes three times as many transactions as it did 10 years ago. In recent years, the company also has expanded the depth of its data, grew its customer base and broadened its distribution. However, many of ATPCO’s product applications are more than 10 years old and no longer meet all customer requirements.

ATPCO is leveraging its existing z/OS technology and infrastructure while rolling out new Web-based products to meet customer needs. Essential to these efforts is avoiding unnecessary hardware acquisition costs and reducing software maintenance costs. The company determined that implementing effective performance management was one of its best options for improving system performance.

At ATPCO, performance management encompasses more than software; it involves multiple organizational units, including the:

The two-member performance team includes a system performance specialist and capacity planner, who tracks performance and does forecasting, and an application specialist, who proactively develops solutions. The system performance specialist monitors and manages system resources to meet business needs and maintain Service-Level Agreements (SLAs). Workload Manager (WLM) policies, which distribute resources to high-priority jobs so they can perform at the expected level, support this process. Since WLM doesn’t understand or resolve the reason why some jobs require more resources, the application specialist is responsible for finding ways to reduce resource consumption.

Augmenting the performance of existing architecture is challenging, perhaps even more so than designing new applications, yet challenges create opportunities that can lead to innovative ideas. Here’s how we handled those challenges from an application perspective.

Steps Taken

We identified critical processes based on business value and visibility and targeted those that would secure the most cost savings. The nature of our business makes the load on our system highly volatile, especially since we can’t control the daily activity. Our goal is good performance with any reasonable load factor. To achieve this, we identified bottlenecks in the critical paths of strategic and business processes and located specific areas for code optimization.

We looked at scalability, focusing on applications that perform well at low load factors but crawl at high load factors, and vice versa. We optimized code for different load factors and created unique paths and implemented self-tuning applications with intelligence built into control sections to choose the most optimal path.

We profiled performance of processes to understand the flow of logic and behavior of functions. Tools help somewhat in gathering information, but getting actionable intelligence with manual processes (such as traversing through application logic to construct signal flow diagrams) is hard manual work. Understanding business functionality is extremely important, and it’s best not to disturb the business function even if it’s not understood initially. We looked for duplication of business functionality and extraneous function calls. We found ways to simplify the logic flow.

Throughout the process, we had to remember that the compiler or the environment may have inserted “invisible” functions or hidden code into the source; this meant we had to look beyond the source code. For example, while evaluating the high Task Control Block (TCB) switching rate of a CICS task, we learned that a third-party instrumentation facility that was pulled in at execution time was the culprit causing the high rate.

We took advantage of Parallel Sysplex for performance and load balancing and to lower third-party software costs by routing workload to the machine with the required software resource. We made minor changes to our applications to accomplish this, but we achieved a significant financial return.

Additionally, some of our databases don’t participate in database federation, forcing us to conduct multi-phase commits manually. This resulted in data integrity being questioned when we had to retrace many steps to do manual fixes. Even though data integrity wasn’t compromised, we considered this a performance issue because of the loss of productivity it caused. We carefully designed a homegrown agent that would oversee the multiple phases of commit and would trigger an undo process in case of failure.

Our back-end database is DB2 for z/OS, which manages about 10TB of data. Poorly performing SQL statements are the easiest to identify and correct. DB2 is well-equipped with accounting information at the correlation-ID level, package level, or even more granular levels. Only when an SQL solution is insufficient do we resort to other solutions.

Not all performance solutions are software solutions; some are procedural. Examples include stacking up non-critical or time-insensitive updates for non-peak hours or running various audit reports together by sweeping the database just once. It may sound like a clear-cut solution, but it’s difficult to get a consensus when working within the layers of communication and time zone differences typical of a global community.

We also faced some interesting business functionality issues. We collect data from different sources (converging on one key) and distribute the same data to a different set of clients (diverging on another key). There’s an authentication process validating who has clearance to input the data and a filtering process at the distribution end. Both are resource-intensive procedures because of the granularity of authentication, but we proved we could reduce cost significantly by rewriting the algorithms. It was risky to change the decision-making modules that are the backbone of the business, but there was a greater risk in not trying. After careful evaluation and thorough testing, we rolled out the new algorithms and they were successful.

We solved some design issues with historical databases that mirrored operational database design. By changing the design, we could curtail exponential data growth. We keep dated information in our database, and old attributes become outdated when new ones are made effective. The old design involved explicitly applying a discontinue date to the old attribute; the new design assumes an implied discontinue date based on the presence of a new attribute. This yielded a more than 50 percent savings in the load operation.

Tools Used

We extracted accounting information from the Tivoli database using native SQL in a format that’s comparable to a DB2PM report. We used thread-level and package-level details to pinpoint likely candidates, then we used STROBE and Data Studio for in-depth analysis.

Techniques

The keys to good performance are to issue as few I/Os as possible, reduce internal data movement, and minimize table processing. One methodology doesn’t suit all circumstances, so ATPCO used numerous methods:

DB2 Solutions

DB2 tuning efforts that gave us the most performance gains involved:

More Improvements

Here are some additional improvements we made and lessons learned in the process:

Conclusion

With an IT agenda that was dominated by cost cutting, ATPCO management had the foresight to invest in efficiencies. A CPU upgrade was deferred in 2007, major applications were installed with no adverse performance impact in 2008, and the momentum continued in 2009. In January 2010, CTO Steve Daniels explained that adjustments to ATPCO’s mainframe performance improved the entire IT services division and “literally saved the company millions of dollars.”

Figure 1 shows how the 2009 average work units per workday increased by 20 percent, yet mainframe usage declined by nearly 5 percent. All solutions haven’t yet been installed because they’re being completed in stages based on resource availability and business needs.