Best practices for transactional application systems aren’t always relevant to an enterprise data warehouse. There are structural differences, including database design and data access profiles. There are also different priorities. For example, data availability may be a higher priority than data recovery. Another significant difference is perceived performance. While an Online Transaction Processing (OLTP) system user might expect sub-second response, in the data warehouse, analytical queries may run for minutes or hours. Adding to this is the differing knowledge and expertise of the support staff, as well as a proliferation of departments managing various aspects of the data warehouse. What are the biggest differences? What are the biggest challenges for staff, Database Administrators (DBAs), and managers?
A Change in Mindset
The four Laws of Database Administration are a guide for DBAs when setting priorities. The following, in order, are the DBA’s most important goals:
• Data recoverability
• Data availability
• Data security
• Application performance.
These laws hold up well for transaction-oriented systems where database designs support and enforce business rules such as uniqueness or referential integrity. They also support enterprise requirements, including Disaster Recovery (DR), as a whole. When we consider the reasons the data warehouse exists and how the data is used, the laws must be re-examined:
Data recoverability. In reality, few data warehouse implementations are considered mission-critical. If disaster strikes, the importance of recovering the warehouse takes a back seat to accounting systems, external customer-facing systems, and the like. It’s common for warehouses to have a DR designation of “recoverable within seven business days.”
There are other considerations. For example, it’s common for data warehouse Extract, Transport, Load (ETL) processes to be executed with a DB2 load utility that designates “LOG NO,” meaning that table changes made during data load aren’t to be recorded on the DB2 logs. This makes table recovery almost impossible for the DBA, unless image copies are taken immediately after loads. But, again, it’s common to forgo image copies during the ETL process to minimize the number of simultaneous jobs running and reduce the amount of getpage and I/O work done in DB2. If image copies are desired, they’re commonly executed later. For these and other reasons, data recoverability isn’t an important consideration in the data warehouse environment (see the accompanying sidebar for a more in-depth discussion).
Data availability. This is the paramount issue in a data warehouse environment. If the data isn't available for loading or analytics, then you aren't serving your customers. The database design usually doesn’t contain primary and foreign key definitions since these already exist in the source systems. There’s little need to enforce business rules such as data validation; again, this is already done at the source. The database design of the data warehouse exists to support the analytics executed against its data. Secondarily, it must support robust ETL processes that populate warehouse tables.
Data security. Oddly, this isn’t commonly much of a concern. Data warehouses typically contain facts that are analyzed and aggregated, along with dimensions used for selection and sub-sets. For example, “analyze account balances across geographic regions X, Y, and Z, determine if average balances are higher in certain sub-regions.” Here, the facts are the account balances and the dimensions are the regions. With this type of analysis, you don’t find highly secure ID data such as birthdates or ID numbers. Yes, facts such as account balances should be secure. However, there’s rarely a complex, column-based, and accessor-based security mechanism. Rather, the warehouse is typically given an overall security designation such as “internal use,” and user access is granted, depending on the person’s need to use the data.
Application performance. Performance, like data availability, is important, but it falls below data availability.
Too Much to Manage?
Depending on the size, importance, and maturity of your data warehouse, there may be multiple departments involved in its support: