May 13 ’09

Canonical Database Architecture and DB2 Performance … Really?

by Editor in z/Journal

Global canonical database enterprise architecture offers the ability to layer a common component to the enterprise design that allows these heterogeneous systems to coexist and interface with a common language. (Canonicality depends on context, but is defined as the usual or standard state or manner of something.) Such an architecture supports a strong management desire to combine all related aspects of enterprise data-centric operations.

Service-Oriented Architecture (SOA) is supposed to address the disconnection that forms between new applications and legacy applications. The disconnection occurs when, over the years, as technology advances, new systems are built using new techniques, and interfaces are built, but are typically custom interfaces, slow or delayed processes, or are crudely constructed in such a way they’re dependent on manual activities. Having custom interfaces that may or may not use differing technologies means more human resources and less flexibility. Keeping the organization flexible means being able to quickly adapt to change while keeping costs low. SOA can alleviate these problems, but it can be difficult to keep operational costs reasonable. Moreover, SOA often underutilizes and improperly manages DB2. However, a proper balance in mapping, caching, and efficient SQL design can help overcome most performance issues.

Enterprise Database Design

From a physical implementation perspective, implementing a global canonical database enterprise architecture means using a set of common interfaces and a common set of tools to let disparate systems quickly communicate. Typically, a common enterprise architecture group is responsible for creating and maintaining these common objects. The result is a core set of people working with a common set of programming tools to bring it all together. The tools typically consist of:

• A common service architecture layer connecting all services that supply data and processes

• Messaging services, usually event-based, that handle the communication across services

• A common message format that’s used across the enterprise to carry information across services

• A set of maps that let the disparate systems and data stores convert data from the local format to the common format carried in the messages.

Management Challenges

There’s never an absolute right or wrong way to do things—only tradeoffs in the costs and benefits of a specific implementation or strategy. There’s always a price to pay. In the case of canonical database architecture, you must be aware of the implementation challenges. It’s often more difficult and costly than it initially seems. Poor implementation of any architecture also can result in performance problems that are near to impossible to resolve.

Mapping Issues

You need a common message format to communicate with new and existing applications. This common message needs to map to these various database formats. Mapping can be efficient and easy to manage as long as the local and common formats remain stable. Unfortunately, stability is elusive in a modern dynamic environment. So if an application is locked into a specific map, how can you manage change? Figure 1 shows two simple COBOL record layouts. In each example, if data movement is by individual attributes, there’s a certain amount of flexibility in mapping. However, if movement is at the record level, then a change to the main format, as shown in Figure 2, could break the local format.

The situation described in this simplistic example is real. Companies can find themselves locked into fixed message formats that make any mapping change extremely difficult to manage. Although our example demonstrates the problem with fixed formats, the problem isn’t just limited to these fixed message formats.

Possible solutions may include creating maps of maps or versioning of message formats. By mapping to maps, you can establish a central cushion to allow some applications a layer of isolation from changes to message formats. This should allow for more flexibility, but adds more complexity to the architecture and possibly more processing cost. Versioning seems more practical, and some message brokering software accommodates it. There’s still increased management cost in maintaining multiple versions, and this can lead to slow adaptation to the messaging architecture and quick abandonment of the architecture due to confusion.

Common Object Management

Typically, a central group manages a set of common data access objects. The common objects are generic in nature and built with flexibility in mind. This typically comes at the cost of performance. When a request comes into the group for access to a certain piece of data, the group first considers the available objects to see if one can be reused.

This makes sense from an object management perspective. Fewer objects to manage lowers management cost. The trade-off is that applications sharing the objects are locked into the use of the object. This means they may be accessing data they don’t need, and then just dropping it onto the floor, leading to potential performance issues. Returning data from DB2 unnecessarily carries a resource cost. Also, if an application needs more information from an object that currently isn’t provided, developers must go to the central group to change the object or create a new one. Any change to the central object could result in increased data access costs for all applications using the object, not just the one requesting the change. Versioning of objects could help resolve this performance issue but at the price of higher management costs.

Change Control

Having generic centralized objects, maps, and message formats can lead to a change management nightmare. Usually, application development isn’t happening in a centralized fashion, but rather in various “stove-piped” groups. Once these various groups begin using centralized objects, they become dependent upon them. Likewise, the centralized objects develop dependencies on the applications using them. This dependency is in the area of change control, regression testing, and application approvals. When an application is using a central object or format, then that application becomes a stakeholder. Any change to the centralized object requires coordinated testing and approval from the various application teams. Often, these teams have their own development and testing in process, and schedules don’t mesh. Scheduling and testing issues can create an environment where applications can be deployed only once per month, leading to possible eventual abandonment of the centralized data access objects.

Potential ways around the change control issue include centralizing the regression testing and giving the central data access team the authority to dictate change to the application teams. At a minimum, you need strong communication to give plenty of advanced notice when changes are coming.

Database Performance Issues

As with any database design, the best performance is achieved using direct access via SQL to the database objects. Getting only the data that’s needed when it’s needed is the key to achieving high performance. Introducing layered access to your data stores can negatively impact database performance. A generic data access object lends itself to generic database performance. The performance issues created by generic data access can’t be resolved by tuning the database or system resources, such as adding memory or CPU. This is solely a problem of the application.

Layers and Layers

The key to high-performance database design is to move data-intensive processes as close to the data as possible. This allows for filtering of data to occur close to the data, and less data to be passed around and processed by other data processing layers. Consider Figure 3. How many service layers must be crossed before a piece of data is actually viewed by a person or processed by an application? As more layers are added to the architecture, more issues arise:

• More layers of abstraction leave more people clueless about how the data is physically stored. This could lead to people choosing data service objects for ease-of-use rather than performance, or choosing an access path that crosses layers that aren’t required.

• More layers of processing objects means more required machine power to run the processes. While machines are cheaper than people, they do need to be managed by people, and machine upgrades typically mean significant increases in software costs. This is especially true for IBM mainframes, where software licensing is typically based on MIPS consumed.

• The more layers involved means more complexity and confusion. The ultimate goal of the architecture is to provide a more simplistic view of the enterprise. However, there’s considerable complexity involved in building the environment to support that. With common object management and mapping, migrations become more intense and involve more people. Ultimately, you’ll probably need more people to support the enterprise architecture than you anticipated.

Generic Object, Generic Performance

This is one of the biggest trade-offs when it comes to database performance. If data access objects are built on top of individual database objects, then your applications have no way of capitalizing on the specific performance features of the database you’re accessing. In tests:

• Coding a two-table SQL join over accessing the two tables in separate SQL statements has shown up to 30 percent CPU savings.

• Reducing the number of columns an SQL statement returns has demonstrated up to 50 percent CPU savings.

Some organizations have considered using the database software as an access method rather than a database. This has led to highly denormalized database designs and, in some cases, a Binary Large Object (BLOB)-type storage of all data into a single database column. While these designs show promise on paper in lowering database access costs, they can have extreme database performance consequences, including:

• Increased backup and recovery costs because there’s no way to separate only the part of the database that changes. This means having to continuously back up all data, leading to increased backup footprint size, huge backup windows, and increased machine costs to support the strategy.

• Extreme costs in handling database updates. This means that while access costs go down, update costs go up because more data has to be updated in the generic design. If the 80/20 rule applies (80 percent read and 20 percent update), maybe the savings outweigh the costs, but if your activity varies, the costs can become excessive. Also, with a generic design, you could end up updating far more often than is necessary. That is, with the generic design the 80/20 rule may not apply, as there can be fewer controls minimizing the issuing of update statements.

• The potential for update anomalies in a generic or denormalized design can lead to data integrity issues. This can increase the costs associated with testing and troubleshooting.

You should seriously consider the costs associated with generic design and denormalization. Consistent use of denormalization as a solution to data access costs for a generic design can slow or even halt a company’s update operations. One company that experienced this problem eventually had to abandon their entire development effort and restart with a more specifically tuned design. This occurred after investing years of development time and money. Even if they could have achieved acceptable performance, their other goal of having a flexible, accessible database (via SQL and Web applications) could never have been realized.

Getting the Best of Both Worlds

A canonical database architecture can work; it can give the enterprise a common methodology for accessing various disparate data stores and processes using messaging and a common communication pattern. It gives the enterprise the ability to quickly build interfaces and react to changes in business.

Is this easy to do? Maybe. Is it the best for performance? No way! You can, however, achieve both commonality and performance if exceptions to the enterprise rules can be made for high-performance situations. Most applications can be written or retrofitted to the enterprise architecture without high-performance costs if they’re generally low-volume applications that don’t move much data. Also, for most mission-critical, high-volume applications, you also can use the enterprise layered design. However, for the high-volume portions of the application, a custom design can be used to deliver huge performance gains. Applying these exceptions to the rule can achieve the desired performance and this approach typically involves a small amount of code that’s accessed most of the time. It gives you all the benefits of enterprise architecture without the major performance implications.

As always, there are trade-offs to consider, and you must remain aware of the specific needs of your applications and database to succeed in these environments.