According to some analysts, the quantity of file-based data produced by businesses exceeded the quantity of so-called “structured” data (industry speak for electronic information produced by databases) a year or two ago. The meaning of this change, they claim, is that files have become a bigger challenge for organizations from the standpoint of data management and storage. Another consequence: mainframes, which are optimized for transaction-oriented (aka database) data handling are increasingly less appropriate for the workloads of contemporary business. I expect this message will be increasingly creeping into the rants of the anti-mainframe crowd in the near future, so I wanted to get out in front of the issue in this column.
Truth be told, data stored in a file format—the output of so-called “business productivity applications”—has been growing at a significant clip for some three decades now. Much of this growth is based on the introduction of distributed computing platforms and the middle- and front-office applications they host in a “user-friendly” way. You know what I’m talking about: word processing, spreadsheet, presentation-making, and other types of apps we use every day. Another driver is the Internet and email. The former sees users copying large swaths of Web-hosted files onto local systems for fear of never being able to find the files again, while the latter has become the dominant method of business-to-business and business-to-consumer communications.
Add to these the amount of data about data that’s being stored in a file format to facilitate the analysis of automated business processes. Companies seek to gather information about what visitors did on their Websites—number of clicks, hits, abandoned purchases, etc.—and spawn myriad files to tap consumers both to promote sales and to measure satisfaction with purchases.
Then there’s the contraband data. Everyone spends their lunch hour downloading files, whether they’re Adobe Acrobat files containing white papers and presentations, photos of this week’s pop star, or videos and music files to play on their “iToys.”
Storing all this “unstructured” file data (actually, file systems are highly structured) is probably the biggest cost accelerator in IT today. This is mainly because of two things. First, users control their files directly. They name them, they assign to them a definition of importance that adheres to no external policy standards, and, for the most part, they quickly forget about them after recording them to disk. So, there’s no way for IT to decide what to do with the data burgeon other than to store it all. Forever. That drives up capacity requirements for data storage, which in turn drives storage purchases higher year over year.
The second reason why file-based data accelerates IT costs is that storage processes aren’t optimized for files. Both mainframe DASD and Fibre Channel SANs in the distributed world are designed to store blocks. While both databases and files are essentially a collection of blocks, files are block abusers. The structure of a file output by a productivity app may contain a lot of extra (wasted) space. On the other hand, “rich media” files such as video or audio may be tightly compressed but sensitive to issues such as layout on the storage media itself. Jitter when playing back a video can be caused by the time required to find the scattered locations where the blocks comprising the video file have been written across disks.
The industry is trying to address the wasted space issue of files through the application of technologies such as deduplication and compression. For storing files generally, the industry is driving customers to consider network storage appliances featuring clustered interconnects combined with a single global file namespace to provide a “horizontally scalable” architecture. File management tools are proliferating at the same time in a desperate effort to wrangle file data into some sort of searchable, manageable, archivable schema.
Mainframers have never been terribly interested in the issues of file storage in part because databases were (and remain) the crown jewels of corporate data. However, as distributed workloads shift to the mainframe, the convenient aphorism that mainframes do databases and open systems do files is becoming more problematic. The mainframe vendors seem to get this.
With the purchase of Diligent Technologies’ deduplication gateway, IBM is endeavoring to bring dedupe to the mainframe world, which doesn’t really need it much for database data, but may need it a lot for file storage. Big Blue’s recent announcement of SONAS, described as a horizontally scalable, NAS-with-global-namespace architecture that connects Linux on System z to NAS “pods,” is another sign that IBM expects file storage and management to become a more common problem for big iron customers.
Without additional technologies for file management, all the bragging about superior storage allocation and utilization efficiency in the mainframe world may soon have a hollow ring. You’ve been warned.