Jul 1 ’03
Growing Your Storage on a Limited (or Non- Existent) Budget
The realities of managing an IT center today are far more difficult than ever before. From 1998 to 2000, the driving force behind the dot.coms, Year/2000, and enterprise system growth fueled the economy and budgets to soaring record highs. In fact, according to the management consulting firm, McKinsey & Co., the total IT expenditures during that time rose from $374 billion to $455 billion, a handsome 10 percent Compounded Annual Growth Rate (CAGR). The money was rolling and people were buying. In fact, they were buying large amounts of hardware to mask the more complex issues of managing the storage they already had, a far more onerous task.
In 2000, the economic engine began to falter and slow. Of course, September 11th was a massive catalyst, as were the faltering dot.coms. Suddenly, the nature of getting the IT job done became far more complex and difficult. Today, with current spending projections flat lining or showing a decline (roughly about a 2 percent overall decrease [$440 billion] in IT spending for 2003), this creates some interesting dynamics for prioritizing and rationalizing how IT must accomplish the job of growing the storage infrastructure annually.
This has created a new set of issues for IT management. First and foremost, nobody seems to have any money. Yet, executive management continues to look for ways to differentiate the company’s products and services from its competition to achieve top line growth. They understand that is the secret to shareholder value and stock price growth. Therefore, the business objectives create an interesting set of requirements within the IT organization. However, growth in the company translates to growth of server and storage requirements. According to Fred Moore, president of Horison, Inc., today’s average growth for storage in the commercial market segment is typically from 50 to 70 percent. However, how can you address this increase if you don’t have the budget?
This is the dilemma that has created another interesting dynamic. The half-life of a CIO is often less than 18 months, and then they expire. Why? Because if the CIO tries to use the same tools and tactics to run the IT center that were viable three or four years ago, he or she will fail to meet the current objectives of the corporation — to grow at no cost. Is it possible to grow the infrastructure at no cost? Yes. Is that an easy thing to do? No, it’s tricky and it’s hard work. However, that is the task facing many IT managers. Although there are many initiatives that will have to be started to get there, here are some guidelines to help you.
To Spend or Not To Spend , That Is the Question
It is nearly impossible to know how much you should spend to sustain an IT-supported function if you don’t know what the value of that process is to the organization. Vendors love that, and you will nearly always overbuy. The sad truth is that most organizations have more than enough storage available to meet their business growth needs. Unfortunately, for many companies, they don’t know how to use this existing storage.
To understand this, let me provide an example: Every enterprise has a senior leadership team who creates a set of business objectives that includes such items as top line growth, bottom line disciplines, market penetration, etc. To achieve these goals, there is a business engineering effort that results in a set of processes, procedures, and protocols. To implement all of these items, the IT organization implements architected solutions. They buy large quantities of hardware, software, and services to make it all work. So far, so good.
However, things often fall short prior to the implementation phase, which of course, is where most organizations are today. What these companies need is a business value analysis that provides an economically based descriptor of the value that each IT-implemented process contributes to the organization when it is running properly; in other words, for every process you support, you would have an objective basis of rationalizing your spending to sustain the process. Elements to consider include capacity, performance, scalability, management, support, and operating expense. These elements account for about 20 percent of the purchase decision for storage (and server) hardware.
The remaining 80 percent of the decision is based on protecting the process; i.e., requirements for replication, re-creation, management, legal, and operating expense. Just look at how primary storage has been purchased for many years. Leading disk storage vendors have all been in a similar technology boat from the point of view of maintaining an application’s Service Level Agreement (SLA). However, one vendor realized that the majority of the selection criteria was based on protecting the data and catered to that need in a way that no other vendor could for years. Many customers didn’t like this vendor, but felt they had no choice. Protecting the data became their priority.
What does this mean to you ?
Are there any opportunities here for you? Yes!
Consider the typical vendor’s sales approach. The vendor approaches you with a story of how they are going to save you money. That’s always interesting, right? They tell you they have the ability to consolidate the many primary storage systems that you currently have down to just a few. They explain the super performance and scalability capabilities they have. Because of that, they can take the various databases and applications that are spread among multiple subsystems and reduce that to one or at least few. You get improvements in performance, fewer components to manage, etc. That may sound OK to you, but you wonder why you would want to put all of your eggs in one basket. They tell you about all the redundant parts, etc., and then they say that even with that, if there is a failure, they have Redundant Arrays of Independent Disks 1 (RAID 1) mirrored physical copy to protect you.
OK, but what about data corruption? For that, vendors will sell you what can generically be described as mirrored physical rotation copies. In this situation, you potentially buy as much as seven times more storage (in this case, disk) capacity. The value for you is that there is a RAID 1 relationship established, which means that for every change on the primary disk volume, an identical change is made to a redundant backup volume. Every three hours, you would rotate out the current copy, and re-establish the oldest copy from 24 hours ago. The controller will remember all of the changes that have been made in the past 21 hours and apply them to the newly established redundant backup volume.
Hence, if you have a data corruption problem, you simply figure out how long ago the corruption started and recover to a period of time just before that. Right. The problem with corruption is like standing at the bottom of a mountain: You look to the top, but you can’t see the snowball gathering speed, size, and kinetic energy, until it is nearly on top of you, and then it is too late. One very large Web-based auctioneer learned that the hard way, and guess what, its corruption exceeded the length of time the company had mirrored copies of data waiting on redundant disk.
Nevertheless, today, there are ways that you can migrate the data off of very expensive high-end disk with street prices of $90 per GB to a less expensive method. Here is how it can work. Imagine that you need 1TB of storage. Using the mirrored physical rotation approach of protection you could buy as much as 8TB for instantaneous recovery from a hardware failure and 24 hours of corruption protection. That will cost about $1,015,000.
OK, let’s start cutting costs. Another approach is to use a virtual storage device that can create mirrored logical copies using virtualization products. Without taking a nosedive into the technology, suffice it to say that the only data that is physically stored from one copy to the next is what changed, not a completely new physical volume times seven (seven extra copies of data for instantaneous protection, corruption, and file deletion protection). To you this means you get the same protection for a third of the price. Want more?
Another emerging technique employs what could be described as continuous real and logical protection. In this case, the redundant copy is kept on an appliance that is greatly reduced in price through the use of Advanced Technology Architecture (ATA) disk arrays. The cost for protection could be reduced to about $140,000 and it also has enough space to provide one to two weeks of corruption protection!
Great! Now go back to the extra disk capacity you have on expensive high-end disk, the other 7TB, move your backup data to the ATA device employing continuous real and logical protection, and use the space now open on the expensive disk for your growth requirements. Specifically, defer the cost of buying any new disk. Remember the business? Grow the business, but you have to do it with a flat or declining budget. By the way, if the terabyte of primary storage was that important, you need to have that backed up to tape and in a vault for safe keeping. You can do that for about $30,000. Even if you don’t want to put the backup copy on tape, you can still use ATA disk for about $15 per GB vs. $90 per GB.
This is just the beginning! However, let’s shift our focus to examine the tactics to significantly reduce spending for managing the protection of primary copies of data.
Tactics for Reducing Spending
We have finished looking at the approach of protection mechanisms for primary disk storage. As I discussed, the most common practice is to create physical copies of disk to either protect against a primary failure (the proverbial piano that falls on your disk subsystem) and corruption protection. Many users have elected to implement systems that protect using redundant copies of data on expensive disk because they felt they had no choice but to provide the protection required to match the value of their applications’ well-being. While that may have been true not long ago, there are less expensive alternatives available today.
We all know and believe that a primary RAID 1 copy of data on disk can be re-established for recovery. This brings us to a very important point. Replication of data is a critical consideration, but it is generally the easy part of the problem. Anyone who is responsible for the uptime of a system knows the tricky part is re-creating a system or application. Going forward, we are going to focus on the re-creation side of things, both technically and economically.
It generally takes about two seconds per terabyte to recover from a RAID 1 physical mirrored copy. There are many ways you can do this, as well as a range of hardware and software products that can make it happen. One alternative to the most common way of purchasing extra RAID 1 physical disk is to use a virtual disk subsystem. In the virtual disk subsystem, there is a copy of the primary volume that is maintained, but it doesn’t occupy a one-for-one amount of physical storage unless you want it to. This gives you the same recovery capability, but at about a third of the price. That takes care of the piano falling on the first system problem, but what about corruption?
The virtual system can also keep consistency checkpoints. The main difference is that the volume checkpoint is virtual, not real. What that means to you is that the amount of physical storage purchased goes from 8TB to 1TB, and you still can have enough extra space to have 24 hours of protection, again at about one-third of the price. Bear in mind that corruption protection on primary disk storage is often over-played and over-valued for what the real-world problems typically are. One nice extra value of a virtual disk subsystem is that you can have far more checkpoints. Where you would typically have only seven volume checkpoints if you were using physical copies, or one for every three hours, with virtual you have the space to have one once an hour. The value to you is that if you have a point in time that you want to back up to, the granularity is one hour. Then, when you have to apply journals and logs, you will be three times faster.
What about open midrange disk? Protection and re-creation for open midrange disk is conceptually the same. A logical approach to protection and recreation from failure and corruption is required. This is most frequently done through application software, and there are a variety of vendors that provide the services.
The big question is how long does it take to replicate and to re-create, and what are the financial alternatives? Remember, budgets are generally flat or declining, and you need to grow storage roughly 50 percent per year. Therefore, you need to start thinking about this a little differently!
Take, for example, a typical midrange disk subsystem. Let’s say it has a backplane speed of 772MB per second and contains at least eight fibre channel interfaces. For ease of discussion, let’s assume the server is not doing anything else and you have dedicated storage. Let’s examine three examples; the first two are unlikely, but will illustrate a concept used to understand the last example.
Imagine a volume or 50GB. If we wanted to replicate or re-create it, and could drive the backplane at full performance, it would take 1.1 minutes. Not bad. If we decided that the data was critical, and we needed a third copy, we could move it from open midrange disk at about $40 per GB to automated tape at about $1.25 per GB, save some money, and get the job done in about 27 minutes. This assumes one 30MB per second tape drive. Is 27 minutes fast enough for your application? Earlier, I discussed the business impact analysis. Look at it this way: If today is Tuesday, and the application goes down, but it doesn’t need to run until Thursday, is 27 minutes fast enough? By the way, this speed doesn’t consider compression, so really you can divide the time by about three.
Many analysts have found that on average less than 20 percent of data needs to be protected instantaneously. Therefore, 27 minutes is probably OK in this case.
What if you had a single 1TB volume? I know that it doesn’t exist, but follow the logic here. Again, at full backplane performance it would take 21.6 minutes to replicate or re-create it. It would take 9.3 hours if we went to a single tape drive, again without compression. With compression you can divide the time by three.
This is more likely. If you have 1TB, you are more likely to see it spread on 20 50GB volumes across two open midrange disk backplanes. In that case, it takes 10.8 minutes to replicate or re-create a full copy. For tape, you would place the 10 drives parallel to each other and you could get the job done in 55 minutes (divide by three for compression). That is plenty of time for most applications — not all of them, but at least most of them.
However, storage administrators frequently want to use disk as the primary source for backup and restore for a variety of reasons. This is fine, unless the cost is greater than you can afford. So, here is an alternative. Instead of doing the copy back to the same disk subsystem for $40GB (open midrange), why not use an ATA array disk for $15GB. This gives you the ability to still have your primary backup and recovery on disk, but far more cost-effectively. Expect about 20 minutes to move a terabyte if you want a full copy back. Remember, you can also use the ATA array as a primary failover volume. If you used the ATA array for a differential copy restore, the recovery time would be minutes.
Don’t forget that if you free up space on the primary expensive disk by offloading to less expensive disks, you now have more room to grow your primary applications cost-free. You may be able to defer your expensive disk acquisition costs for years by using this strategy.
An emerging technology could be thought of as continuous real and logical. It is a technology that builds on an ATA array and precludes the need for an application to replicate or restore data. It has multiple Gigabyte Ethernet interfaces and is driven by a driver and a 1U appliance. Imagine that a volume (database or file system) is defined to protect. Any write to the primary open midrange disk would also follow a RAID 1-like write to the ATA array. Any change to the primary disk has an identical change to the continuous real and logical disk. If you take a consistency checkpoint from a database, a new logical disk is created (again virtual technology).
The benefits are that you can have two to three weeks of corruption protection and instant failure protection — all for about $15 per GB. Is it fast? Assume a 1TB system. After two hours of write changes to a database, you could recover from a corruption failure in 2.9 minutes!
The economics of using expensive high-end disk storage are unsustainable in today’s economic and market climate. To meet the demands of maintaining SLAs and availability agreements on a flat budget, decision makers must approach the challenge different from what they have in the past. Albert Einstein once said, “The significant problems we face cannot be solved at the same level of thinking we were at when we created them.” We think Albert was dead on. Z