z/Journal recently spoke with ConicIT founder and CEO Yoram Kariv about how production performance problems and the technologies that exist to resolve them continue to be challenging for IT.
z/Journal: You have a distinguished past and have been deeply involved with performance and leading-edge technologies for many years. What area do you believe still needs improvement?
Yoram Kariv: Although we’ve come a long way from the first OMEGAMON performance monitor more than 30 years ago, service management for large data centers is still an area that staff and management wrestle with daily. There are many tools, and many experts using those tools, but it seems the problem areas still exist. There are still service infractions and war rooms every Monday morning.
z/J: You’re right. Production service error alerts and analysis are a perennial pain point for IT shops. Why hasn’t the problem been solved?
Kariv: IT departments minimize production problems in two ways. The first way is through testing and QA in an attempt to make sure that deployed systems are as bug-free as possible. The second way is by providing tools and processes that quickly recognize production problems and enable them to be fixed as soon as possible after they occur. In other words, IT minimizes production problems through both prevention and treatment.
The problem is that neither of these methods works very well for resolving production performance issues. Most performance problems are caused by unexpected events and complex run-time component interactions, not bugs. An unexpected combination of the state of the system, the load, and the transactions causes the system to slow in some way. No matter how good your development staff or testing team, unexpected states will occur and production systems will always falter in some way. A performance slowdown will hinder users and keep them from achieving their business goals in an easy and timely fashion. This problem is even more acute for mainframes (or other highly virtualized transaction environments such as public, private, or hybrid clouds), where multiple tenants share resources on the same physical machine, creating unforeseen resource contention.
Fixing performance issues after they occur isn’t a satisfactory solution. By the time IT recognizes the problem, it has affected users and the root cause is hidden under layers and layers of symptoms. A sad statistic is that, on average, even for a top-notch IT shop, more than 50 percent of problems are found by users. The resulting war rooms, tiger teams, and other methods of getting all the possible relevant people together to try to diagnose and fix the problem are time-consuming, productivity killing, and expensive.
z/J: Can you quantify the impact of these problems on the business?
Kariv: There are a number of ways to quantify the impact of production performance problems.
A well-run IT shop will have 25 to 40 application incidents per month, each taking seven to 11 hours to resolve. That’s a lot of person-hours and productivity lost in handling production problems. Anything that increases the mean time between failures and lowers the mean time to repair can significantly lower these costs.