Oct 11 ’11

Storing and Retrieving XML in DB2 for z/OS

by Susan Lawson, Daniel L. Luksetich in z/Journal

You’ve probably heard a lot about IBM’s pureXML feature of DB2 and seen some examples of data stored and retrieved from XML columns. pureXML is the native XML storage and retrieval feature of DB2 that lets an XML document be stored in a format the database engine can manipulate. The ability to store XML natively in DB2 is vital, and DB2 is feature-rich in managing and manipulating XML. From a business perspective, however, there are several choices that need to be made when managing XML data with DB2. Which choice you make depends on your objectives, the perceived costs, the personnel and skills available, and the amount of time dedicated to establishing and maintaining the database housing the data.

When choosing to store XML in DB2, you have three choices. You can store the:

Once we understand how the XML data type is stored in DB2, and how with each option we can move the data in and out of DB2 and/or manipulate that data, then we can decide which storage option is best for a particular application or situation.

XML Usage Benefits and Concerns

With XML, you can represent multiple elements as a single object, such as a purchase order; with relational data, this may require several tables and normalization. An insert of an entire purchase order in XML would involve a single column and retrieval is from one column. Inserts with relational data would involve many columns (with consistent format) and queries for retrieval would require joins.

XML also offers a better data model for many new applications. It provides added flexibility, schema versatility, and is stored in a hierarchical nature. You can store both semi-structured or unstructured data. The data is inherently hierarchical, and may be nested or contain complex data. For data that’s changing or evolving, schemas can be supported easily. XML data is self-describing in that the data is mixed together with metadata (tags). You can query data and metadata simultaneously. The XML data is extensible and isn’t regular and homogeneous like tables, so every document can have a different structure. An element can have different data types in different documents.

Not all data should be stored in DB2 as XML, as there can be some overhead to XML storage in DB2 depending on your needs. Storing data in a relational database relationally is still the best and most flexible way to store data. With a proper relational design, you can minimize data anomalies, improve performance, reduce the cost of changing data, and properly store metadata with your user data. Most of this is lost with XML storage. There are additional storage and processing costs associated with XML, and the XML itself is more voluminous than storing the data relationally due partly to all the tags and namespace declarations.

Why Store XML in DB2?

Benefits of managing XML data with DB2 include lower development costs, reduced code and development complexity, improved developer productivity, faster solution development, and lower costs. DB2 XML support also provides for greater business agility because you can easily accommodate changes to data and schemas. You can also update applications rapidly, reduce maintenance costs, respond quickly to dynamic conditions, and realize value earlier.

SQL or XQuery (not covered here) can be used to retrieve data; you can join XML data to relational data. You can quickly and efficiently search XML documents using a proven search optimization engine of a mature database. You can store and manage both relational and XML using an existing technology infrastructure and skills. XML data can be inserted into a column and optionally validated against a schema. User-defined indexes on XML data facilitate searches. The XML data can be extracted and decomposed into relational data. You can construct XML documents from both relational and XML data.

Starting with DB2 9, there’s support for native XML hierarchical storage with no shredding and no CLOBs or Binary Large Objects (BLOBs) required. No parsing of XML data at query run-time is required and it’s fully integrated with relational processing, providing for seamless queries accessing various types of data simultaneously. DB2 XML support provides schema flexibility so changes don’t force unloading or reloading of data.

By using the XML native store instead of shredding the documents, you can reduce development time; shredding often results in complex mappings requiring significant knowledge of the data when developing SQL to retrieve it.

Storing XML in DB2 for z/OS

XML is stored similarly to the way Large Objects (LOBs) are stored. XML objects contain data not stored in the base table space, but rather in a separate table space. The XML objects are automatically created by DB2 when an XML column is defined to a table. All the utilities either support or tolerate XML, with some specific XML keyword support. Support and maintenance for XML have more considerations than LOBs, such as table space size limit, compression, and indexes.

To estimate basic XML storage after parsing, you can use a 1:1 average as a guideline. It could be as small as 0.3 (stripping whitespace and using compression) of the original document or as large as 1.5 (preserving whitespace without compression) of the original document. You can store XML data and complex data without mapping in DB2 with optional compression (with the potential for up to 70 percent savings). An XML table space always uses a 16KB page size; you must ensure a 16K buffer pool exists for this. For non-range-partitioned base table spaces, a Partition By Growth (PBG) table space is used for XML. XML table space partitioning follows base table partitioning for the range-partitioned table space. The number of rows that can fit into a base table partition is limited by the number of documents that fit into an XML partition. If the XML tree nodes can’t fit on one 16K record, they’re split into multiple records. The LOG, LOCKMAX, and COMPRESS settings are inherited from the base table space. Default values are provided for the other attributes.

XML data is a corporate asset just like relational data. It must be managed, shared, analyzed, and protected like any other critical data. There are additional considerations when working with XML objects, especially when it comes to recovery. To be able to recover a base table space, take image copies of all related objects. Use LISTDEF to define a list of related objects (keyword XML or ALL). For a point-in-time recovery (RECOVER TORBA, TOLOGPOINT), all related objects, including XML objects, must be recovered to a consistent point in time.

Similar to referential integrity, there’s a relationship between the base table space and the XML auxiliary table space. The CHECK DATA utility can be used on the base table space to determine consistency between the base table space and the XML table space. Use the CHECK INDEX utility to find problems with the DOCID and NODEID indexes. REBUILD INDEX can be used to repair the problems. The COPY and RECOVER utilities can be used together on dependent objects (XML table space, DOCID/NODEID indexes). QUIESCE TABLESPACESET can achieve a single point of consistency. REORG and RUNSTATS utilities for XML table spaces are executed independently from the base table spaces, but LISTDEF can be used to execute these utilities against XML table spaces together with base tables.

Relational to XML

Data stored relationally (as columns in rows) can be converted into XML by using a powerful set of built-in DB2 scalar and aggregate functions. Invoking these functions can be simple or complex; you can transform any data in DB2 into an XML document of any complexity.

Using the DB2 sample database, for example, we can take the basic information about our fictitious company, The Spiffy Computer Service Company, and generate XML documents that contain information about the company’s departments and employees. Figure 1 is a simple example that retrieves the names of our employees in a basic XML document.

 

You can take many combinations of the built-in XML functions to produce complex XML documents for transmission back to your application, or for communications to other information systems via such things as messaging or file transfer.

Figure 2 takes advantage of several XML functions to build a document containing the information for some departments of our company and all the employees in those departments.

 

If you look closely at the two “Department” documents generated, you can see the formation of a nice hierarchical structure that contains the relevant department and employee information. All this was generated from traditional DB2 relational table structures!

DB2 also provides for a simple way to move XML data that has been stored in string columns into columns with an XML data type. The XMLPARSE function can take a string representation of a well-formed XML document and convert it into an XML value. This can make the transition from storing XML as a string to storing it as pureXML as easy as executing a single statement (see Figure 3).

pureXML Storage and Retrieval

Along with the XML data type, IBM has delivered an extremely robust enhancement to the SQL language in the form of XML extensions (SQL/XML). In DB2 for z/OS, this extension comes in the form of XPath expressions. Through the IBM pureXML Guide (SC19-2981-01), you can quickly get an overwhelming impression of the programming capabilities possible. A simplified view would be to always think of an XML document as a hierarchy represented as a tree, with a path leading to each node of the tree. XPath expressions are used to declare or reference variables, perform function calls, perform comparisons, index documents, perform searches, and more. Figure 4 contains some examples of XPath expressions.

 

To understand how XPath expressions work with the XML data type, consider the PRODUCT table in the DB2 sample database. This table contains a primary key column Position Identifier (PID) and an XML document in an XML column called DESCRIPTION. Query 1 of Figure 5 reveals the document for a product with a PID of ‘100-100-01’.

DB2 XPath expressions can be used against this data to manipulate it in any way. In the simple example in Query 2 of Figure 5, we use an XMLQUERY function, which returns an XML value from the evaluation of an XPath expression, to return only the product name from the XML document. An XPath expression is used.

XPath expressions can be used to search for values in a document. Query 3 of Figure 5 returns the names of products from the PRODUCT sample table by restricting the data to only those products that sell for less than $10. Filtering is performed by using an XMLEXISTS function to specify an XPath comparison expression to search all the XML documents in the table.             

This is important, powerful functionality, but results in a complete table space scan of the PRODUCT table. You can, however, build indexes into an XML document using XPath expressions. Figure 6 shows a CREATE INDEX DDL statement that builds an index on the price element in the XML DESCRIPTION document of the PRODUCT table; this approach can improve XML search performance.

XML to Relational

Since DB2 can generate XML from more traditional rows and columns, and work with XML natively, it only makes sense that it can also parse XML data into individual columns. So you can generate traditional relational data from XML in DB2. There are two built-in functions that enable this capability:

Figure 7 contains two examples of queries that pull elements out of the XML DESCRIPTION column of the PRODUCT sample table and return them as columns. Since these are invocations of built-in functions, they can be executed virtually anywhere you can put an expression in a statement (i.e., SELECT, INSERT, UPDATE, DELETE, etc.).

 

Data Validation

By default, DB2 will only validate an XML value to the extent it’s well-formed. To provide more detailed validation of documents internally in DB2, you must create an XML schema. An XML schema defines structure, content, and data types for XML documents. It consists of one or more schema documents. A schema document can define a namespace (optionally) and data types for elements and attributes. There are basic data types (i.e., integer, date, decimal, string), user-defined types, and complex element types. The schema also describes the allowed length and patterns for string values and supports type inheritance and derived data types. An XML schema document is defined in XML syntax. Detailed occurrence and value range definitions are supported in XML schemas.

XML schemas must be registered with the database before they can be used for validation. DB2-supplied stored procedures are used to register XML schemas. The XML schemas are then stored in the XML Schema Repository (XSR) and used by DB2 only. A binary representation is generated for the entire XML schema and then stored as a BLOB in the XSR. This helps with performance for validation since there’s no need to read individual schema documents each time during validation.

XML schema validation is the process of determining whether the structure, content, and data types of an XML document are valid according to an XML schema. In addition, XML schema validation strips ignorable white space from the input document. XML schema validation is done by invoking the DSN_XMLVALIDATE function from within the XMLPARSE function. DB2 10 will perform validation and XML parsing “inside the engine” using XML System Services (XMLSS).

You can validate an XML document in DB2 10 automatically by including an XML type modifier in the XML column definition in a CREATE TABLE or ALTER TABLE statement. The XML type modifier specifies one or more XML schemas with the XML data type. When a column has an XML type modifier, DB2 implicitly validates documents that are inserted into the column or updated. Or, you can do it manually by executing the DSN_XMLVALIDATE built-in function when you insert a document into an XML column or update a document in an XML column. Validation is optional when you insert data into an XML column with no XML type modifier, but is mandatory if an XML type modifier exists. The LOAD utility will also validate XML data if an XML column being loaded is associated with an XML type modifier.

The XML type modifier is specified in a CREATE TABLE or ALTER TABLE statement and associates an XML column with an XML schema. Figure 8 shows an XML type modifier being applied in the Data Definition Language (DDL) for a table definition that contains an XML column.

 

Manual XML validation is enabled via the DSN_XMLVALIDATE built-in function. The input to the function would include an XML string expression and a reference to an XML schema. Using this function with a registered schema provides the necessary flexibility to validate the XML document. You don’t necessarily need a type modifier for the column in the table. Figure 9 shows manual validation.

All XML data stored in XML columns must be well-formed. There’s no difference in stored XML data whether you validate or not, unless there are default values defined in the XML schema. You must consider the costs of validation during application design; validation is two to three times as expensive in CPU utilization as well-formed parsing only. However, it’s eligible for processing with the System z Integrated Information Processor (zIIP) and System z Application Assist Processor (zAAP) specialty engines. Validation increases CPU time for inserts and reduces throughput. Use schema validation, if needed, but avoid schema validation for highest possible insert performance.

Setting a Direction

The method you choose for storing data (i.e., rows and columns, character string, XML data type) really depends on several factors. The most significant factors are where you want the XML to be processed and your staff’s skills.  

If the data doesn’t have to be shared with any other applications in the enterprise, and it doesn’t need to be manipulated or maintained in any way, then perhaps storing it as a string is sufficient. It requires the least amount of knowledge for your DBAs and will use the lowest amount of mainframe CPU. If the data must be manipulated or shared, then storing it as a string may become too expensive. The data will have to be interpreted and analyzed before being shared in the enterprise. This could result in many more trips across databases and applications, and can seriously impact network traffic because XML strings can become quite large. Moreover, if an XML schema or dictionary must be maintained, it must be maintained outside the database. This means you will need personnel capable of maintaining this information; perhaps keeping that information integrated with the enterprise model and personnel. Backup and recovery plans also must be coordinated between the database and whatever repository contains the dictionary or XML schema. Finally, with no data definitions stored in DB2, the XML data is virtually useless outside the application responsible for it. Not very flexible!

If you’re sharing the data with other applications and users across the enterprise, then it may be beneficial to store the data in rows and columns. This can be especially true for data in transaction environments that change frequently. This provides a high level of flexibility and the potential for high performance. The data definitions are stored in the DB2 system catalog; the database engine performs validation. A myriad of indexes can be built to support all sorts of data access paths. A wealth of built-in functions can generate XML from the DB2 tables. DBAs know how to deal with “normal” DB2 tables, and there would be little or no XML skills required for the DBA beyond knowledge of the built-in functions. This option can, however, be more performance-challenged than the other options when all the data is read all the time.

Storing the data natively as pureXML affords the best of both worlds. It offers the ability to share the data with other applications and databases, indexing capabilities, and the option for built-in element definition and validation. All this can happen inside the database engine so the data and definitions can be backed up and restored together. Powerful functionality can manipulate the data in many ways to afford compatibility across applications and users. To take advantage of all this functionality, your DBA staff must be skilled in the areas of storage, database maintenance, XML, and XML schema construction and management. Performance could be better than rows and columns when entire documents are always required, but could degrade as more manipulation affects performance and data is shared in pieces across many applications.

Summary

XML is here to stay and is becoming increasingly popular. The support and capabilities in DB2 for z/OS are providing us with ways to take advantage of XML and provide support for applications requiring it. IBM’s pureXML capabilities offer a secure, reliable, and familiar way to store and retrieve data. Understanding the benefits and costs of storing XML data in various forms in DB2 can help determine the best way to store and use this data within your organization.