Compression is fascinating. It’s been around in one form or another for almost forever, or at least as long as we’ve had data. In the land of DB2, we started out with software compression. A compression “program” would be installed in place of the EDITPROC to compress your data as it was added to DB2. At one point, compression software could be a company’s primary claim to fame. It worked, it was impressive, and it saved disk space. However, it could be expensive in terms of CPU usage. You used it when you absolutely needed to reduce the size of a table- space. DB2, for a while, even supplied a sample compression routine based on the Huffman algorithm that could be used in place of the EDITPROC. Compression was considered a necessary evil.
In December 1993, that changed. Hardware-assisted compression, or Enterprise Systems Architecture (ESA) compression, became a feature of DB2. Because of the hardware assist, the cost of the compression process was significantly reduced. Compression began to gain in popularity. Turn the clock forward about 10 years, and although compression is now widely accepted, another problem reared its ugly head: short on storage conditions. One suggested method for reducing the amount of storage being used by DB2 was to turn off compression. The compression dictionary had to be loaded into memory, a resource that became increasingly valuable with the newer releases of DB2. Unfortunately, the amount of data in DB2 also had significantly increased. It wasn’t all just data warehousing and Enterprise Resource Planning (ERP) applications causing it, either. The amount of data supporting everyday Online Transaction Processing (OLTP) also was increasing at an outstanding rate. Compression has become a must have and must use tool.
Compression is a topic that needs to be revisited, given the number of customers planning and upgrading to DB2 Version 8. This article will cover some compression basics, including why sometimes you’re told to avoid compression, tablespace compression, and finally, index compression.
Back in 1977, two information theorists, Abraham Lempel and Jacob Ziv, thought long strings of data should (and could) be shorter. This resulted in the development of a series of lossless data compression techniques, often referred to as LZ77 (LZ1) and LZ78 (LZ2), which remain widely used today. LZ stands for Lempel-Ziv. The 77 and 78 are the years they came up with and improved their lossless compression algorithm. Various forms of LZ compression are employed when you work with Graphic Interchange Format (GIF), Tagged Image File (TIF), Adobe Acrobat Portable Document Format (PDF), ARC, PKZIP, COMPRESS and COMPACT on the Unix platform, and StuffIt on the Macintosh platform.
LZ77 is an adaptive, dictionary based compression algorithm that works off a window of data using the data just read to compress the next data in the buffer. Not being completely satisfied with the efficiency of LZ77, Lempel-Ziv developed LZ78. This variation is based on all the data available rather than just a limited amount.
In 1984, a third name joined the group when Terry Welch published an improved implementation of LZ78 known as LZW (Lempel-Ziv-Welch). The Welch variation improved the speed of implementation, but it’s not usually considered optimal because it does its analysis on only a portion of the data. There have been other variations over the years, but these three are the most significant. IBM documentation refers to only Lempel-Ziv; it doesn’t distinguish between which variation DB2 uses.
The term “lossless compression” is significant for our discussion. When you expand something that has been compressed and you end up with the exact same thing you started with, that’s called lossless compression. It differs from “lossy compression,” which describes what occurs when images (photographs) are compressed in commonly used formats such as Joint Photographic Experts Group (JPEG). The more you save a JPEG (recompressing an already compressed image), the more information about that image you lose until the image is no longer acceptable. This process would be problematic when working with data.
DB2’s Use of Compression
Data compression today (and since DB2 V3) relies on hardware to assist in the compression and decompression process. The hardware is what prevents the high CPU cost that was once associated with compression. Hardware compression keeps getting faster because chip speeds increase with every new generation of hardware. The z9 Enterprise Class (EC) processors are even faster than zSeries machines, which were faster than their predecessors. Because compression support is built into a chip, compression speed gets faster as new processors get faster.