It seems like a thousand years ago when file systems were first introduced as part of operating system software. Back then, designers made a few choices—predicated on the economics of storage at the time (primarily capacity and cost per KB)—that made sense then, but have come back to haunt us today.
Popular file systems are self-destructive. If you save a new file to the same name as an existing file, you overwrite the old with the new. One can speculate that this wasn’t a bug, but a feature when the file system was first created—a way to economize on extremely expensive data storage hardware.
Given the dynamics of the hard disk market today, with annual increases in capacity of approximately 100 percent per year accompanied by an annual reduction of cost per GB of 50 percent per year, legitimate questions could be asked such as whether we still need or want a self-destructive file system or if some sort of versioning capability would be preferable. Think of all the “miniature data disasters” (some not so miniature in their consequences) that occur daily, when users overwrite a valid file by accident, and what kind of productivity hit accrues to each event. Is it time to change the file system?
Another feature of contemporary file systems is that file names are neither associative nor self-describing. Some applications do place an extension on the file name so you know what application generated the file, but this isn’t consistent across all apps or operating systems. As for self-description, the data democracy has created a data anarchy in which huge volumes of anonymous files are amassing, providing no clear idea of what the contents of the file are, whether the files are important, or whether they have gone stale.
The ramifications of these file system design choices can be seen in companies that are scrambling to put their data houses in order, whether driven by a desire to contain storage costs or to comply with regulations. In most shops I visit, storage has become one big junk drawer, with companies buying more and more capacity just to keep pace with all the impossible-to-manage data growth. Management is legitimately concerned about the accelerating costs of storage investments (disk prices may be falling on a dollar per GB basis, but array prices are actually accelerating at nearly 125 percent per year!) and about the huge potential legal exposure represented by unmanaged data in an increasingly aggressive regulatory climate.
How do we fix the file system? The idea of doing a wholesale rip and replace has been offered by IBM, Oracle, and Microsoft over the years. Nuances in their approaches notwithstanding, they are proposing the same basic concept: turn the file system into an object-oriented database. Microsoft’s WinFS, which is ultimately to be mounted on a clusterable version of its SQL Server database, is just the latest example.
This approach, however appealing to engineers, often meets with pushback both from users, who wonder how they will sort their existing junk drawers to bring that data into the new scheme, and from independent software developers, who fear such an innovation will isolate their output—and effectively their products—from the market. While Redmond might wrap itself in the flag of do-gooding, many suspect their ultimate objectives might be like those attributed to IBM in the last decade: to become the undisputed king of data.
Politics aside, another alternative is to implement a scheme for building better indexing of data. EMC introduced Content Addressable Storage (CAS) in its Centera platform, which cobbled together cheap disk and some indexing technology from Filepool into a one-stopshop for data indexing. However, this will be replaced shortly by hardware-agnostic, software-based CAS. The point, however, is that CAS begins to open the door for enhanced metadata—data about data—that can be used to describe files more effectively than we can today, and to access files faster and with less manual searching.
Combine some sort of enhanced metadata indexing with Continuous Data Protection (CDP) technologies, such as those being pioneered by Revivio and others, and you might just have a solution to data self-destruction. Another way to add value is through the creation of specialized algorithms of data searching (think Google on steroids), which are appearing today from many companies.
Until such a cobbling of CAS, CDP and content-aware indexing happens, a lot of folks are investing in archive systems catering to specific types of data: databases, e-mail, workflow document management systems, and end user files. This isn’t a solution to the problems of the file system itself, but only a delaying action. Over time, the archives also will become unwieldy and difficult to manage. Even with the hard work being done by Bridgehead Software, CA, and others to develop a “Manager of Managers” in data-type archiving, these silos still require a separate cadre of skilled operators and a lot of hard work to make them deliver order to the anarchical universe of data.