The mind is a wonderful and powerful thing. When you read the title for this article, your mind identified duplicate strings in the data, parsed them out, and may have generated a chuckle interrupt. Clearly, there is an enormous quantity of duplicate sequential data in our enterprises and the maintenance of this data is a key driver for the hardware, energy, and facility costs of storage.
While z/OS users may be skeptical of the potential benefits of de-duplication, anyone who has an email account understands how large chunks of data can be duplicated countless times. In addition to the copies of meeting notices, presentations, and spreadsheets received from members of your work group, we all have a few friends whose primary purpose in life is to forward, to you and everyone else they know, countless images and video clips on subjects ranging from patriotic themes to the fashion trends of shoppers at big-box stores. Hence, it is easy for most of us to envision the de-duplication value proposition for an email server. Moreover, sequential backup processes that create generational copies of the file systems structures amplify this data duplication.
In this article, we will examine seven primary topics:
- is the hypothesis of data de-duplication plausible,
- a discussion of fixed and variable length segmentation for de-duplication,
- provide a high level overview of how de-duplication works,
- de-duplication workload characterization,
- examine the potential benefits of de-duplication for DB2 tables,
- discuss performance considerations and metrics, and
- comments and observations.
Each of these topics will be discussed in detail in the following sections.
1. An Audacious Hypothesis
At first examination, the concept of data de-duplication presents an audacious hypothesis. Why should it work? Consider for a moment a 10 TB store of data that is managed in 4K segments. The store would be comprised of 2.6 x109 4K segments. While this is an impressive number, each of those 4K segments could be considered as a binary number comprised of 32,768 bits. That is, each segment, considered as a number, could represent an integer value between 0 and 232768-1! Hence, if the content of the data segments were random at any level of significance, it would be likely that most of the 4K segments would be unique. While we have used a 4K segment size for this example, the results are equally implausible for segment sizes as small as 128 (21024 potentially unique segments) bytes.
The reader should not conclude that the prior paragraph argues for smaller segment sizes. Rather, the size of the pointer tables required to virtualize the backing store grows inversely with the size of the segment. Moreover, the 512 byte sectors employed by the backing store devices effectively puts a floor on the size of a segment.
For de-duplication to be effective for z/OS backups, there must be a very high probability of consecutive backup generations having significant commonality. While the author has no intent of offering a proof of this assertion, there is compelling anecdotal evidence to support it. Consider three classes of data common to most z/OS installations:
z/OS Residence Volumes: since the cowboy days of system programming have long since been replaced by draconian change control policies, the change rate of the data contained on these volumes is small. In addition, enterprises often freeze changes during key periods (like year-end processing) to avoid the potential of changes inadvertently compromising production processing.
Libraries: the libraries (PDS and PDSE datasets) that contain program products, control members, application source code members, and JCL streams also change at a glacial pace. Usually, prudent programmers typically create a new member rather than change the content of an existing member name.