DB2’s optimizer generally does a good job of achieving the best possible performance for your SQL applications. However, normal RUNSTATS don’t always provide the data necessary to choose the best access path. DB2 Version 8 uses column distribution statistics effectively, as Tom Moulder explained in his August/September z/Journal article, “The Most Important Feature in DB2 Version 8.” This article will go into greater detail about the difficult trade-off between RUNSTATS costs and their benefits, as well as the pitfalls in data retrieval and maintenance. You should consider getting distribution statistics, but be sure to keep them tidy and useful.
Column Distribution Statistics
In database tables, some value combinations occur more frequently than others. Column distribution statistics are a means of quantifying this kind of “data skew” as the frequency of certain value combinations. In DB2 V8, you can get this information by specifying the COLGROUP option for RUNSTATS, whereas older versions only allowed FREQVAL specifications (for leading column groups of an index) or the use of the DSTATS. DSTATS has been available since DB2 V5 and is available as a free download from IBM at ftp://www.redbooks.ibm.com/redbooks/dstats/.
When you’re binding or preparing dynamic SQL, the DB2 optimizer figures out an access path based on the estimated costs of different access methods. For single table SELECTs, choices are limited, and though they may be wrong due to skewed statistical data, the extra costs are usually tolerable. When multiple tables are JOINed, data skew may cause serious performance degradation because the optimizer might choose the wrong outer table or join method for the size of the intermediate result set. This happens because the optimizer makes assumptions about the filter factor, which is the proportion of data returned from the evaluation of a predicate. It assumes, for example, the filter factor of an EQUALS predicate would be 1/CARDF.
For correlated columns and skewed data, most predicates have a lower selectiveness; they return more rows than DB2 would expect at that stage of the execution. Typical examples of correlated data are CITY, STATE and ZIPCODE, or FIRSTNAME and GENDER in a customer table. After applying a predicate such as CITY = :hostvar-city, the predicate STATE = :hostvar-state won’t exclude many columns.
Figure 1 shows the SQL I found on a customer’s production system (slightly modified for anonymity). Figure 2 shows the corresponding EXPLAIN output before and after applying column distribution statistics. In this example, you can see data skew on single columns. While HIGH2KEY could address the special TERM_DATE for a customer account from normal RUNSTATS, the optimizer would still assume that ENTRY_TYPE ‘B’ occurs as often as any other one and plan the access path accordingly. Since a nested loop join is efficient only on small result sets, the access path is changed to a merge scan join that has better characteristics on bigger result sets. The choice of the outer table may or may not be a consequence of the access method, but is less significant with merge scan joins anyway. Further examples and explanations can be found in Tom Moulder’s article.
Deleting Obsolete Statistics
No matter how you gather your distribution statistics, you should know how to get rid of them. Because column correlation statistics are comparatively expensive to create, you might not want to apply your normal maintenance cycle to them. That’s probably why IBM chose not to delete old column distribution statistics when you perform a RUNSTATS without special options. To remove statistics, specify the FREQVAL or COLGROUP options with a COUNT 0 or just delete the obsolete data from SYSCOLDIST. Since it only needs to scan index data in sorted order, RUNSTATS will be faster with the FREQVAL specifications than with the COLGROUP statistics.
When using the COUNT 0 method described above, you should take additional measures after dropping or changing indexes because SYSCOLDIST will still hold data related to the columns that used to be leading index columns. So you might end up with outdated statistics, since those entries aren’t automatically updated or deleted in all cases. (You typically change indexes via a DROP followed by a CREATE statement containing the refined design.) To be sure your statistics remain up-to-date, verify the content of your SYSCOLDIST table after index changes when you use RUNSTATS with FREQVAL.