Oct 7 ’11

Linux on System z:  Choices in File Systems

by David Boyes, Ph.D. in z/Journal

Oracle has publicly stated that OCFS2 1.6 (the latest version of its officially supported cluster file system that’s a critical part of Oracle’s high-availability solutions) will be available only on Oracle Unbreakable Linux, which only runs on Intel hardware. Naturally, this is a serious problem for those of us who are enlightened to the fact the whole world is not Intel hardware, and there are a lot of people—including the Linux distribution vendors—who are madly scurrying around, trying to find a useful alternative. Fortunately, there are a number of good choices for replacing OCFS2, and here we discuss four of the more interesting ones.

The key issues around cluster file systems are coordinating which system is writing to the data at any given point, how the cluster maintains consistent metadata information in an environment where multiple systems can update, and making sure that individual system failures in the cluster don’t interrupt availability of data and processing on the rest of the nodes in the cluster. The alternative file systems previously mentioned address all these key issues, but in different ways.

The first alternative is also probably the oldest of the set—Red Hat’s GFS2 (formerly a product of Sistina, later borged by Red Hat). GFS2 uses a system of network-based updates (usually over a private network segment) and SCSI reserve/release commands to ensure data integrity and write management. It has been updated to have some awareness of System z FICON devices, but it still works best with FCP disk. GFS2 uses the standard Red Hat “cluster” tooling to manage nodes (plus some extra goodies), so we can consider it a useful candidate to work with both Intel and non-Intel systems. More information is available at www.redhat.com/gfs/.

Second, there’s Open Lustre. Lustre was originally an open source project but was purchased by Sun not long before the sale to Oracle. Lustre provides very high levels of scalability—dozens of nodes, dozens of file servers, and near wire-speed performance. It uses a matrix of file servers (these hold data) and metadata servers (replicated for high availability and performance) that manage the metadata about the files stored within the Lustre system. It’s a great tool, widely used in the scientific community, but Oracle abandoned it with all of Sun’s other investments in high-performance computing shortly after the acquisition. The original open source project has been revived and has taken up development again, producing an attractive option for systems that need to run nodes for both Solaris and Linux systems. More goodness at http://insidehpc.com/2011/06/17/video-open-lustre-new-organizations-trends-and-future-developments/.

Third, there’s PVFS2, another great development project for massively scalable data storage (and recently ported to System z by Neale Ferguson). PVFS2 (aka OrangeFS) provides all the features needed by a powerful cluster file system with source code available. Great intros appear at www.orangefs.org/.

Last is Ceph, which tackles both cluster file system functions and policy-based storage management (kind of like DFSMS). Linus Torvalds has accepted the Ceph code into the Linux base, so while Ceph isn’t yet stable enough to trust for really critical data, it’s definitely one to watch. See http://ceph.sourceforge.net for papers and other interesting stuff about performance and functions.

Which would I choose? GFS2 and PVFS2 are probably the most stable of the projects in terms of production-readiness and all of the options have strong support efforts available. Tracking that—and the third-party support options other than the distribution vendors—provides a very strong set of options in place of OCFS2.

For those of you who missed the 2011 VM Workshop in Columbus, OH, a great time was had by all; it was the best $100 I’ve spent in years (since 1998, to be exact—the date of the last VM Workshop). If you can get to next year’s workshop, do it—it’s great fun. Visit http://vmworkshop.org  if you want to get involved—everyone is welcome.