Feb 1 ’09
Not Sure Who You Are? Ask DB2
Customers in all industries often experience challenges while searching for their customers’ names in a database. Sometimes, there may be a misspelling in the name. Sometimes, the customer may not remember what name they’re registered under, and sometimes, we simply need to search a range of names. When dealing with large amounts of data, SQL accuracy and search execution performance can be challenging. However, certain DB2 techniques can help improve performance for name searches of various sorts. This article discusses a few interesting scenarios, including one involving name searching against DB2 data using a legacy key, interesting data variations, and some more generic name search examples and tips.
Generic vs. Quick Hit Techniques
Often, when providing search support, developers take a one-stop shopping approach to search screens and subsequent search queries. Managers and users ask for the ability to search on a variety of fields and any combination of them. Developers look at the myriad of combinations and quiver at the prospect of the complex code to support these searches. The result typically is a big, fat, generic search query that spans many lines of code and has no efficient access path. Sometimes, these queries even contain switches that will activate or inactivate compound predicates, such as “AND :HV1 = 1.” This type of switching doesn’t belong in SQL statements and instead should be supported by if-then-else logic in the application program with branching to multiple queries.
With this in mind, consider your search screens with relevance to the data, the business needs, and common practices of users and customers. What are the most common search field combinations? Are there any search fields you can force to be required? At a minimum, there should be fields available that can be used to provide the best search performance. These should serve as “quick hit” fields and be the basis for your primary query. Yes, that means you’re going to code multiple search queries. This is usually necessary if you want to achieve good performance, especially when searching large databases.
Assuming customers are calling in (or logging on) without knowing their customer number, Figure 1 presents a generic search query that supports a variety of optional search columns. Each compound predicate, separated by an “OR,” supports a different combination of search criteria. In this example, it sup-ports a search on name, address and partial name, or phone number and partial name. This is a powerful search, but may not deliver optimal performance. At best, the optimizer may be able to get multi-index access with carefully designed indexes. When fast response and minimal CPU consumption are important, this may be the easy programming choice, but not the best performance choice.
An easy, economical solution is to build two queries with a little extra application code to support them. While more understanding of the data and common search input is required, the return in response time and CPU savings is worth the effort. For example, if most customers who call in can provide their last name and phone number, then the query in Figure 2 could be executed before the generic query in Figure 1. Coordinated training of customer service representatives to ask for these common fields could help achieve extremely fast search times.
The Need for Boolean Term Predicates
Generic searches yield generic performance. Often, application developers are coding predicates in queries that support scrolling through a search result set as well as a direct hit search result. Again, multiple cursors are better from a performance perspective than a do-all, multi-purpose cursor.
When it comes to scrolling cursors, it’s important to use Boolean term predicates in your queries. A Boolean term predicate is a simple or compound predicate that, when evaluated to false, makes the evaluation of the entire row false. Figures 3 and 4 show queries that contain Boolean term and non-Boolean term predicates. If you look at Figure 3 and ask yourself if any one of the simple predicates alone can eliminate a row, the answer is no. So, there are no Boolean term predicates in that query. In Figure 4, the simple predicate “AND EMPNO >= ?” can alone eliminate rows from the answer set. That predicate is a Boolean term.
The reason we care whether our queries contain Boolean term predicates is that DB2 can only consider Boolean term predicates for efficient single index access. If your query contains no Boolean term predicates, then the best access you can get is either non-matching index access (usually to avoid a sort) or multi-index access.
There’s a simple solution to inefficient index utilization for scrolling cursors—add a Boolean term predicate! Figures 4 and 5 show two examples of queries that produce the same result with both non-Boolean term and Boolean term predicates. Often, you may be adding a redundant predicate, but the improved index access is worth it. Figure 4 shows our query from Figure 3 with a redundant Boolean term predicate, “AND EMPNO >= ?”, added. This predicate provides no additional filtering because the query already contains simple predicates that filter for EMPNO equal or EMPNO greater than the input value. However, because it is a Boolean term, DB2 can use the index defined on EMPNO for potentially faster query performance. Figure 5 shows a batch cursor operating on a range of a combination of fields using Boolean term predicates for matching index access of one column. Since the predicates are all connected by “AND,” logically a SEG value outside the range of input values will eliminate rows without the need for DB2 to check the SUBSEG value. So, DB2 can use an index on SEG for potentially faster query response. A future release of DB2 may be able to automatically do this for us, but for now (as of DB2 9), we must do it ourselves.
Choices for Multiple Searches
Often, search operations must handle multiple combinations of input. This is common in situations such as name reversals (typing in the last name where the first name is supposed to be and vice versa). The flexibility of the SQL language gives us many choices for these types of searches. The query in Figure 6 obtains the data for the first combination of fields, and then uses a during join predicate to search on the second combination of fields only when the first returns no result. DB2 will execute the second join only when the first finds nothing. This is an efficient way to perform an optional search.
The three queries in Figure 7 perform exactly the same function. The first relies on the potential for multi-index access. The second uses a UNION to run both searches as separate queries for the potential for full index access on each separate sub-select. The third uses a common table expression to build a table containing the search criteria in multiple columns and then join that table to the table being searched. This could boost index access for queries that search based on multiple optional columns. Default values could be built into the search criteria for potential full index access on optional search fields.
Which one of these techniques illustrated in Figure 7 is best for you? You must find out for yourself. This means running EXPLAIN against each of the statements as well as benchmarks to determine the actual impact of the query under simulated realistic conditions. This may seem like a lot of work (it usually isn’t), but the response time and CPU benefits last for years.
Denormalizing the Common Data
In many search situations, there are common criteria and not-so-common criteria. When you think about searching for a person, would it be faster to find someone named John Smith or someone named Zaphod Beeblebrox? Chances are John Smith, being a more common name, would be a more expensive search. Now, add in such things as potential spelling errors and you need to add such things as LIKE predicates, compound predicates connected by OR’s, or the use of hash codes or soundex algorithms to make the search more generic. This combination of factors can cause extremely slow performance for generic searches against common names.
In some of these situations, people have resorted to denormalization for improved search performance. The data is stored in a relational database for specific reasons: flexibility, data consistency, and manageability. Denormalization defeats all of these reasons and, in some cases, is an admission of failure. You should definitely try all options before considering denormalizing your database.
If you find your back against a wall and must consider denormalization, then perhaps a partial denormalization is possible. In a situation such as a 10-minute search response for a common name against 1 billion stored names, denormalizing a data store of 1 billion names is unmanageable. The search performance problems were only with the common names. The solution in this situation was full denormalization for only common names. Analysis of the data was performed, and a table of the keys for the most common names was developed. This table was used as the driver for an extract and load process to pull the common data and move it into the denormalized table (see Figure 8). It also was used as the driver for database triggers that fired whenever changes to the base tables were made and data needed to be denormalized to the common name search table. The search process was modified to search the denormalized data first, and if nothing was found, then run the regular search query against the full set of normalized data (see Figure 9).
All this can be quite complicated, but the payoff is worth it; it’s sometimes the only way to meet Service-Level Agreements (SLAs).
Generic searching can be a performance killer. Specific searches are a performance advantage. DB2 provides several options for searching either generically or specifically. Taking advantage of these features with some clever experiments and coding can prove extremely beneficial and offer an extreme advantage over denormalization.