Aug 1 ’08

IT Sense: Don’t Be the Dupe

by Editor in z/Journal

A short while back, IBM announced its acquisition of Diligent Technologies, a player in the data de-duplication space. De-duplication is a marketecture umbrella describing a range of technologies used to squeeze data so that more of it fits on a disk spindle—especially if that spindle is part of a Virtual Tape Library (VTL)—or so it can be moved across a thin WAN link more efficiently.

On the announcement call, analysts asked the usual questions about the overlap between Diligent functionality and de-dupe functionality IBM already offered customers, courtesy of its relationship with Network Appliance (answer: Big Blue continues to offer both technologies), and what this meant for Diligent’s existing resale arrangements with competitors such as Hitachi Data Systems and Sun Microsystems (answer: both have continued).

The story might end here, were it not for some recent statements by Network Appliance about key differences between the two de-dupe technologies. According to NetApp, Diligent’s method of de-duplication puts data at risk of “noncompliance.” Responding to a survey of de-duplication vendors that I created on my blog, NetApp contrasted its approach with that of “inline de-duplicators” such as Diligent.

Inline de-duplication’s main benefit, NetApp’s spokesperson wrote, “Is that it never requires the storage of redundant data; that data is eliminated before it is written. The drawback of inline, however, is that the decision to ‘store or throw away’ data must be made in real-time, which precludes any data validation to guarantee the data being thrown away is in fact unique. Inline de-duplication also is limited in scalability, since fingerprint compares are done ‘on the fly’; the preferred method is to store all fingerprints in memory to prevent disk look-ups. When the number of fingerprints exceeds the storage system’s memory capacity, inline de-duplication ingest speeds will become substantially degraded.”

He continued, “Post-processing de-duplication, the method that NetApp uses, requires data to be stored first, and then de-duplicated. This allows the de-duplication process to run at a more leisurely pace. Since the data is stored and then examined, a higher level of validation can be done. Post-processing also requires fewer system resources since fingerprints can be stored on disk and hence require fewer system resources during the de-duplication process.”

“Bottom line,” the fellow contended, “if your main goal is to never write duplicate data to the storage system, and you can accept ‘false fingerprint compares,’ inline de-duplication might be your best choice. If your main objective is to decrease storage consumption over time, while ensuring that unique data is never accidentally deleted, post-processing de-duplication would be the choice.”

This setup is necessary to understand a key point raised by NetApp about the questionable acceptability of de-duplicated data by regulators and law enforcement entities concerned with the immutability of certain data. Wrote NetApp: “The regulators want proof that [certain] data has not been altered or tampered with. NetApp de-duplication does not alter one byte of data from its original form. [It is] just stored differently on disk. One interesting point though, is what happens if a ‘false fingerprint compare’ as previously described with inline de-duplication occurs. Now the data has been changed. Because of this, inline de-duplication may not be acceptable in regulatory environments.”

NetApp’s position earned it flames from many of the other survey respondents. While they all carried the party line that de-dupe is simply describing data with fewer bits, they accused NetApp of flying a self-serving flag of Fear, Uncertainty and Doubt (FUD) by raising the issue of the acceptability of de-duplicated data—more specifically, data de-duplicated the Diligent way—to regulators and courts of law.

I’m beginning to wonder about this characterization. Some of my clients, especially financial institutions, are creating policies to exclude certain files from de-duplication processes on the off chance the data will be viewed as “altered” by folks at the Securities and Exchange Commission (SEC), Department of Justice (DOJ), and elsewhere. Despite reassurances by de-dupe vendors that the technologies are defensible, and the noble quest of IT folk to “do more with less” by squeezing more data onto fewer spindles, none of this means anything if the courts decide that data so squeezed is no longer “full, original and unaltered.” Thus far, there has been no test case to make or break the argument that de-duplicated data is OK.

My recommendation to IT practitioners who read this column is simple: Before deploying de-dupe technology, which is finding its way into mainframe VTLs as we speak, touch base with your legal or risk management department honchos. Explain (or, better yet, have your vendor explain) how de-dupe works on your data and get a written approval for deploying the technology. Keep an original copy in a waterproof, fireproof container—don’t file it electronically in a de-duplicated repository!

While there’s no case law about the acceptability of de-duplicated data, there’s plenty of precedent for legal disputes rolling downhill from the director’s or senior manager’s offices to the trenches of IT. When it comes to de-dupe, the objective of the IT practitioner must be one of self-defense. Don’t be “the dupe.”