IT Management

Deduplication technology has taken the data center by storm in the past five years. However, as with many hot topics—such as who’s who in the Middle East, or the underlying facts about global warming—one sometimes assumes that everyone understands the basics.

Deduplication is a method of reducing storage volumes by identifying duplicate data segments, and instead of storing the data again, simply storing a pointer to the data that’s already been stored. For example, imagine that a server contains 150 copies of the same 10MB presentation, which contains the company’s standard “thank you” slide. The same server also contains that slide embedded in hundreds of other presentations. If that server were backed up to tape or plain disk, every copy of every presentation would be copied over the network and stored—requiring at least 2GB of capacity. Over time, weekly full backups would quickly push backup storage requirements beyond 10GB. However, backing up this server to a deduplication storage system would require less than 1GB of storage because only one copy of the presentation and only one copy of the thank you slide would be stored. As this example shows, a primary motivation to employ deduplication is to improve resource efficiency by not wasting storage capacity and, therefore, money.

As this example shows, backups are inherently redundant because a full backup includes repeatedly backing up and then retaining the same data for three to four months. This extreme redundancy makes backup data sets an ideal use case for deduplication. Even incremental backups contain duplicate data segments, since in many backup applications a change of just a few bytes prompts the application to back up the entire file again. However, deduplication isn’t about storing unique files; it enables truly impressive efficiency by storing unique subfile segments of data. Leading deduplication technology identifies unique segments and only stores new, unique segments, even if the duplicate segments are in different files. The thank you slide represents a subfile segment that would be stored only one time—no matter in how many different files it was contained. The amount of storage reduction that deduplication can provide will vary in every environment, as deduplication ratios are dependent on the amount of duplicate data. However, a typical enterprise backup environment achieves a 10- to 30-time reduction in backup storage required.

Deduplication Methods

The key to highly efficient deduplication is being able to identify unique segments quickly and efficiently. The best way to do this is by breaking a data stream into subfile segments and using a hashing algorithm to generate a unique identifier for each segment. This identifier is then compared with existing segments on the system to determine if it’s unique. If a segment is unique, it will be compressed and stored. If the segment is already on the system, a pointer to that segment will be stored in its place.

Deduplication can be done “inline” as the data enters the device, or “post-process” after it has landed on the system. Inline deduplication will invariably require less storage, as the deduplication happens during the backup process, which means only unique deduplicated data is written to disk. Post-process deduplication requires that data first be written to disk and then deduplicated. As a result, to deduplicate 10GB of data post-process requires an extra 10GB “landing zone,” while inline deduplication requires a fraction of that capacity, as it only stores the unique deduplicated data.

Mainframe Deduplication Benefits

So what can deduplication deliver in a mainframe environment? First, for backup data, which is naturally redundant, deduplication can dramatically reduce the storage footprint and, therefore, the cost of backup storage. In addition, smaller data volumes mean that backups can be retained onsite for longer periods of time, keeping them accessible for fast, reliable restores. Finally, deduplication of backup data can speed IP-based replication, since only unique segments are replicated to the remote site, making disaster recovery feasible without upgrading network infrastructures or interrupting business operations.