DFSORT: Taming Wild and Unruly Data

5 Pages

In a perfect world, data in columns would be neatly aligned and easy to sort, edit, and display. But here in the real world, things are often messier. Data can get pretty wild and unruly: It can be unaligned, aligned differently in different records, and difficult to sort, edit, and display. This article shows and explains an example that illustrates some of the newer features of IBM’s DFSORT that can whip wild and unruly fixed fields into shape. Future articles will discuss additional features for manipulating data in various ways, including how to parse variable fields such as Comma-Separated Values (CSVs) and delimited fields into fixed fields, and how to deal with records with different layouts.

Input Data

Figure 1 shows the wild, unruly input data we want to tame in our example. The input data set has four fields:

  • Name: A 12-byte field containing character values that can appear anywhere in columns one through 12. (These are names of some of my adorable pet rats, past and present.) These unaligned values are difficult to sort because blanks and alphabetic characters can appear in the same column and can be difficult to read because they’re unaligned.
  • Amount1: A 10-byte field containing signed numeric values that are right-aligned in columns 13 through 22. Each value contains a floating plus or minus sign and a decimal point, and may or may not have a comma separator. Although the values are right-aligned, they’re difficult to sort because digits, blank, plus, minus, and comma characters can appear in the same column.
  • Change: A 14-byte field containing signed numeric values that are left-aligned in columns 23 through 34. Each value contains a decimal point, and each negative value is surrounded by a left and right parenthesis, making them difficult to use in numeric operations.
  • id: A 10-byte field containing unsigned numeric values that are left-aligned in columns 35 through 44. Each value contains dashes and may or may not contain leading zeros, making them difficult to read.

Goals for Our Example

We have several goals we want to accomplish for our example:

  • Sort the records by the name and amount1 values. If we tried to sort on these values using CH (character) format, the result would appear as in Figure 2. You can see the records aren’t sorted correctly. Katie is incorrectly sorted before April and the plus value for the first Katie record is incorrectly sorted before the minus value for the second Katie record. So, we need a special way to correctly sort these values.
  • Replace each change value with a new amount2 value derived from amount1 + change and displayed in the same form as the amount1 value. Since the encoding for these values doesn’t correspond to any of the standard numeric formats (i.e., ZD, PD, BI, FI) and they’re actually encoded in different ways, we need a special way to correctly add and display them.
  • Display each id value right-aligned with leading zeros. Since the encoding for these values doesn’t correspond to any of the standard numeric formats, we need a special way to correctly display them.
  • Create an output report with fixed data and averages for the amount1 and amount2 values. Since the encoding for these values doesn’t correspond to any of the standard numeric formats, we need a special way to correctly derive the averages.
  • Create a second output data set with CSVs like this: “name”,“amount1”, “amount2”,“id”, but with no plus signs in the amount1 and amount2 values. We need a special way to create these delimited values from our fixed values.

Control Statements for our Example

Figure 3 shows the DFSORT control statements we need to accomplish the goals for our example. We’ll discuss what each control statement does and identify the new functions used in that control statement. Then we’ll discuss the details of each new function. Although this particular example uses the new functions while sorting, keep in mind they also can be used while copying or merging.

We use the INREC statement to left-align the name values so we can correctly sort them. Since the INREC statement is processed before the SORT statement, the left-aligned name values will be used by the SORT statement rather than the original unaligned name values. The new functions here are OVERLAY and JFY.

5 Pages