Understanding what happens to data throughout its lifetime is becoming an increasingly important aspect of effective data management. What happens to data as it ages? Does usage decline as data ages? Does the value of data increase or decrease as it ages? Why are we keeping more data longer than ever before? What conditions indicate when data should be retired? Do storage management requirements change as data goes through its life cycle? If data is the most valuable asset of so many businesses, why do we know so little about it? Why doesn’t anyone consider their non-digital information as part of their Information Life Cycle Management (ILM) strategy? These important questions need answers so we can understand how data should be managed and where data should—ideally—reside during its existence.
What Is ILM?
ILM is not a new concept. Businesses have been attempting to manage their data throughout its life cycle for years with varying degrees of success, whether it meant backup, archive, migration, or deleting data. As data ages, it’s assigned different priorities and some gets stored on cheaper storage devices, depending on how often it’s used or potentially needed. This has been a fundamental principle of the Hierarchical Storage Management (HSM) concept for years. That sort of hierarchical storage is only one piece of ILM, however. ILM also needs policy-driven data classification capabilities as well as the capability of non-disruptive movement of data between storage devices when data is used less or deleted. In some cases, the value of data changes based on circumstances having nothing to do with age or activity levels. Consider medical records that haven't been used for long periods of time and then become important in determining the cause of a new disease such as Severe Acute Respiratory Syndrome (SARs), West Nile Virus, or a new strain of flu. Historical data could become extremely valuable almost instantly in these cases.
As is the case for most advanced storage management functions, mainframe users have had the advantage of a robust, policy-based HSM software solution to perform many of these policy-based life cycle management tasks for more than 15 years. HSM is just now becoming a more popular choice for larger Unix and NT storage systems. For the majority of data types other than system files, the number of references to data significantly declines as the data ages. This basic observation serves as the basis for more cost-effective storage management as it enables the movement of less active data to lower-cost levels of storage. The required solution for a policy-based ILM strategy is required to support all major computing systems, not just the mainframe.
We are now witnessing a new effect on the life cycle of data: The amount of data now increases as it ages (see Figure 1). Unlike the past, fixed content and archival storage have now become the fastest growing areas of the storage industry. Storage demand grew more than 100 percent per year during the dot.com boom of the late ’90s. Today, the storage industry is generating new data at a rate in the range of 40 to 70 percent per year, depending on the business. In addition, some of the current demand for storage is presently being consumed from existing and unused capacity that’s the result of the excessive buying habits of the past several years. Regardless of the growth rate, the continual increases in the amount of digital data have made storage management more difficult and, as a result, more data is being accumulated for longer periods of time. Much of this data lives digitally without effective storage management services.
The percent of all digital data that has lost its value and, therefore, should be deleted is quickly declining as obsolete data is often “just kept around forever.” In many cases, this approach is perceived to be easier than managing the data throughout its life cycle. The probability of reusing data typically falls by 50 percent after the data is three days old or three days since its creation. Thirty days after creation, the probability of reuse normally falls below a few percentage points. e-Mail and medical imaging applications represent good examples for the data aging profile described here. Keeping very low activity, archival and inactive data on spinning disks for long periods of time is not economical for environmental reasons (energy consumption and security issues), in addition to the tangible differential in storage prices between disk and tape per unit of storage purchased. The table in Figure 2 provides additional insight into numerous storage consumption and usage patterns that will place even more emphasis on effective data life cycle management.
Data Retention Requirements Change
When the StorageTek Nearline concept was becoming widely accepted in the ’90s, the common belief was that archival status was the last stop for the data before deletion or end of life. Then, one- to two-year data retention periods were viewed as a reasonable amount of time to keep digital data accessible. Fifteen years later, the game and rules are different. New government regulations, the Sarbanes-Oxley Act, and HIPAA requirements for transmission and retention of data have made us change the way we look at data as it ages. Several major healthcare providers are faced with generating and storing in excess of 500TBs of data over the next few years that will need to be managed and retained for a person’s lifetime plus seven years. This time period could often exceed 100 years.
SEC rule 17a-4(t) mandates digital archiving requirements as they relate to storage. These include what type of storage format should be used, how long data must be retained, and where and how long duplicate copies of data must be stored. The back-end of the data life cycle is swelling, not shrinking, as was the case previously, and retention policies are now being based on data value and legal issues, not just reference activity. For lifetime data management, “It doesn't matter if the data is ever used; it does matter if the data is there.” This change in the storage landscape calls for new management policies based on the value of data and requires that a universal, standard classification scheme for data needs to emerge. All data is not created equal.