The Ceph filesystem

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
November 14, 2007

Ceph is a distributed filesystem that is described as scaling from gigabytes to petabytes of data with excellent performance and reliability. The project is LGPL-licensed, with plans to move from a FUSE-based client into the kernel. This led Sage Weil to post a message to linux-kernel describing the project and looking for filesystem developers who might be willing to help. There are quite a few interesting features in Ceph which might make it a nice addition to Linux.

Weil outlines why he thinks Ceph might be of interest to kernel hackers:

I periodically see frustration on this list with the lack of a scalable GPL distributed file system with sufficiently robust replication and failure recovery to run on commodity hardware, and would like to think that--with a little love--Ceph could fill that gap.

The filesystem is well described in a paper from the 2006 USENIX Operating Systems Design and Implementation conference. The project's homepage has the expected mailing list, wiki, and source code repository along with a detailed overview of the feature set.

Ceph is designed to be extremely scalable, from both the storage and retrieval perspectives. One of its main innovations is splitting up operations on metadata from those on file data. With Ceph, there are two kinds of storage nodes, metadata servers (MDSs) and object storage devices (OSDs), with clients contacting the type appropriate for the kind of operation they are performing. The MDSs cache the metadata for files and directories, journaling any changes, and periodically writing the metadata as a data object to the OSDs.

Data objects are distributed throughout the available OSDs using a hash-like function that allows all entities (clients, MDSs, and OSDs) to independently calculate the locations of an object. Coupled with an infrequently changing OSD cluster map, all the participants can figure out where the data is stored or where to store it.

Both the OSDs and MDSs rebalance themselves to accommodate changing conditions and usage patterns. The MDS cluster distributes the cached metadata throughout, possibly replicating metadata of frequently used subtrees of the filesystem in multiple nodes of the cluster. This is done to keep the workload evenly balanced throughout the MDS cluster. For similar reasons, the OSDs automatically migrate data objects onto storage devices that have been newly added to the OSD cluster; thus distributing the workload by not allowing new devices to sit idle.

Ceph does N-way replication of its data, spread throughout the cluster. When an OSD fails, the data is automatically re-replicated throughout the remaining OSDs. Recovery of the replicas can be parallelized because both the source and destination are spread over multiple disks. Unlike some other cluster filesystems, Ceph starts from the assumption that disk failure will be a regular occurrence. It does not require OSDs to have RAID or other reliable disk systems, which allows the use of commodity hardware for the OSD nodes.

In his linux-kernel posting, Weil describes the current status of Ceph:

I would describe the code base (weighing in at around 40,000 semicolon-lines) as early alpha quality: there is a healthy amount of debugging work to be done, but the basic features of the system are complete and can be tested and benchmarked.

In addition to creating an in-kernel filesystem for the clients (OSDs and MDSs run as userspace processes), there are several other features – notably snapshots and security – listed as needing work.

Originally the topic of Weil's PhD. thesis, Ceph is also something that he hopes to eventually use at a web hosting company he helped start before graduate school:

We spend a lot of money on storage, and the proprietary products out there are both expensive and largely unsatisfying. I think that any organization with a significant investment in storage in the data center should be interested [in Ceph]. There are few viable open source options once you scale beyond a few terabytes, unless you want to spend all your time moving data around between servers as volume sizes grow/contract over time.

Unlike other projects, especially those springing from academic backgrounds, Ceph has some financial backing that could help it get to a polished state more quickly. Weil is looking to hire kernel and filesystem hackers to get Ceph to a point where it can be used reliably in production systems. Currently, he is sponsoring the work through his web hosting company, though an independent foundation or other organization to foster Ceph is a possibility down the road.

Other filesystems with similar feature sets are available for Linux, but Ceph takes a fundamentally different approach to most of them. For those interested in filesystem hacking or just looking for a reliable solution scalable to multiple petabytes, Ceph is worth a look.

Index entries for this article
Kernel	Filesystems/Ceph

(Log in to post comments)

Google fs?

Posted Nov 15, 2007 15:05 UTC (Thu) by dion (guest, #2764) [Link]

This sounds a lot like Google FS, to the point where I'm sort of missing a comparison.

Could it be that the Ceph author read the GFS paper <http://labs.google.com/papers/gfs.html>
and reimplemented the ideas?

... not that is a bad thing, many great pieces of code have been written as  implementations
of other peoples ideas, some times better than the original.

Google fs?

Posted Nov 15, 2007 17:44 UTC (Thu) by i3839 (guest, #31386) [Link]

Probably not. I was thinking about a distributed filesystem too, and what I came up with
resembles Ceph quite a lot. That was a few years ago, never had the time or urge to implement
it. Point being, most design decisions are rather obvious when you set out the requirements.

Google fs?

Posted Nov 15, 2007 18:12 UTC (Thu) by dion (guest, #2764) [Link]

I wasn't trying to dismiss Ceph as being uncreative in any way, I was just missing the
comparison with google fs.

Google fs?

Posted Nov 15, 2007 18:34 UTC (Thu) by sayler (guest, #3164) [Link]

After reading the OSDI paper
(http://www.usenix.org/events/osdi06/tech/full_papers/weil...), Ceph tries
to solve a more general problem than GFS.  

Note that this neither makes GFS nor Ceph better than the other.  

Like many solutions to problems in distributed computing these days, GFS optimizes for
specific  workloads (the Ceph authors claim "Similarly, the Google File System is optimized
for very large files and a workload consisting largely of reads and file appends." -- check
out section 8 of the OSDI paper for more comparisons).  

In general, Ceph is a much conservative approach, wrt the file system interface.  The claim is
that the FS exported by Ceph is general purpose, exposes POSIX file semantics, and performs
well across a wider variety of workloads.  It will be hard to say whether this is *true*
(whatever that means) until it is used much more widely..

I made some comments on the Kernel Trap discussion of Ceph, but I'll repeat the high point
here:  it's cool to see research software GPL'd and targeted toward a general audience.

There are quite a few interesting local and distributed filesystem projects going on right now
(meaning, ones that are attempting to be more than research vehicles).  I look forward to
seeing btrfs, Hadoop's cluster FS, Ceph, and other projects which I have no doubt forgotten
about.. :)

Google fs?

Posted Nov 15, 2007 20:02 UTC (Thu) by zooko (guest, #2589) [Link]

There's my project -- http://allmydata.org Tahoe .  It is an open source distributed
filesystem with very nice security properties -- everything is encrypted and integrity-checked
and you can share or withhold access to any subtree of the filesystem with anyone.  It uses
erasure coding so that you can choose any M between 1 and 256 and any K between 1 and M such
that the file gets spread out onto M servers, where the availability of any K of the servers
is sufficient to make the file available to you.

Regards,

Zooko

Google fs?

Posted Nov 15, 2007 23:31 UTC (Thu) by sayler (guest, #3164) [Link]

yay! erasure coding!

Google fs?

Posted Nov 16, 2007 3:40 UTC (Fri) by zooko (guest, #2589) [Link]

Yeah!  Also we've separately published the erasure coding library, which is derivative of
Luigi Rizzo's old library, but to which we added a Python interface, a command-line interface,
performance optimization, and other stuff.

http://pypi.python.org/pypi/zfec

Cluster file systems

Posted Nov 22, 2007 11:34 UTC (Thu) by ringerc (subscriber, #3071) [Link]

I've long been frustrated by the lack of a solid, general purpose distributed cluster file
system for Linux. There's a real need even if you *don't* have terabytes of data and hundreds
of servers.

Right now, there's no good way to distribute things like home directories on the network, even
just between a few Linux servers. You either land up using a central server or set of servers
(failure prone, harder to expand, a pain to maintain, etc) or using a cluster FS that relies
on an iSCSI/AoE/FC shared storage backend (expensive, complex). You'll be lucky to find a
network file system that'll give you reliable home directories, either, with correct handling
of lockfiles, various app-specific databases, etc.

In short, even for a common and simple problem like making sure that users have the same home
directory across several machines, all the existing options seem to stink.

For that matter, even traditional options like using NFS to export homedirs from a central
server have, in my experience, been less than reliable. I'm unimpressed with Linux's NFS
support on both the server & client side; I've seen too many unkillable processes,
unlinked-but-perpetually-undeleted files, etc, and I've had to *REBOOT* too many servers to
fix NFS issues. And that's without going into the issues with NFS's not-quite-POSIX FS
semantics.

A similar issue applies for virtualized servers. They need storage somewhere, and unless you
can afford a SAN that storage is going to be server based (whether internal or external, it
doesn't matter). There's always one box you can't bring down for maintainance without bringing
down some/all of your VMs. Being able to provide distributed storage for VMs would be quite
wonderful.

If I was really dreaming I'd want a native client for Mac OS X and for Windows, so that there
was no need to re-share the cluster FS though a gateway server. Experience suggests that such
native file system implementations are rarely solid enough to work live over the network with,
though, and are usually only good for copying files back and forth.

Even without that, just being able to collect the storage across the servers at my company
into a shared, internally redundant pool that would remain accessible if any one server went
down would be ... wonderful.

DRBD?

Posted Mar 19, 2010 18:20 UTC (Fri) by niner (subscriber, #26151) [Link]

You might want to have a look at DRBD (http://www.drbd.org). It's a
distributed, redundand block device which you can use instead of the
expensive iSCSI/AoE/FC shared storage backend to base your cluster FS
upon. It works quite well with good performance and excellent coverage of
possible error conditions.

Even without a cluster filesystem (and it's complexities), you can at
least build a good shared nothing failover cluster for your central
storage server.