Ceph Distributed Network File System

November 15, 2007 - 8:02am

Submitted by Jeremy on November 15, 2007 - 8:02am.

"Ceph is a distributed network file system designed to provide excellent performance, reliability, and scalability with POSIX semantics. I periodically see frustration on this list with the lack of a scalable GPL distributed file system with sufficiently robust replication and failure recovery to run on commodity hardware, and would like to think that--with a little love--Ceph could fill that gap," announced Sage Weil on the Linux Kernel mailing list. Originally developed as the subject of his PhD thesis, he went on to list the features of the new filesystem, including POSIX semantics, scalability from a few nodes to thousands of nodes, support for petabytes of data, a highly available design with no signle points of failure, n-way replication of data across multiple nodes, automatic data rebalancing as nodes are added and removed, and a Fuse-based client. He noted that a lightweight kernel client is in progress, as is flexible snapshoting, quotas, and improved security. Sage compared Ceph to other similar filesystems:

"In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely on symmetric access by all clients to shared block devices, Ceph separates data and metadata management into independent server clusters, similar to Lustre. Unlike Lustre, however, metadata and storage nodes run entirely in userspace and require no special kernel support. Storage nodes utilize either a raw block device or large image file to store data objects, or can utilize an existing file system (XFS, etc.) for local object storage (currently with weakened safety semantics). File data is striped across storage nodes in large chunks to distribute workload and facilitate high throughputs. When storage nodes fail, data is re-replicated in a distributed fashion by the storage nodes themselves (with some coordination from a cluster monitor), making the system extremely efficient and scalable."

Submit to: Reddit, Digg, Slashdot, Del.icio.us, OSNews

From: Sage Weil <sage@...>
Subject: [ANNOUNCE] Ceph distributed file system
Date: Nov 12, 9:51 pm 2007

Hi everyone,

Ceph is a distributed network file system designed to provide excellent 
performance, reliability, and scalability with POSIX semantics.  I 
periodically see frustration on this list with the lack of a scalable GPL 
distributed file system with sufficiently robust replication and failure 
recovery to run on commodity hardware, and would like to think that--with 
a little love--Ceph could fill that gap.  Basic features include:

 * POSIX semantics.
 * Seamless scaling from a few nodes to many thousands.
 * Gigabytes to petabytes.
 * High availability and reliability.  No single points of failure.
 * N-way replication of all data across multiple nodes.
 * Automatic rebalancing of data on node addition/removal to efficiently 
   utilize device resources.
 * Easy deployment; most FS components are userspace daemons.
 * Fuse-based client.
 - Lightweight kernel client (in progress).

 - Flexible snapshots on arbitrary subdirectories (soon)
 - Quotas (soon)
 - Strong security (planned)

(* = current features, - = coming soon)

In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely on 
symmetric access by all clients to shared block devices, Ceph separates 
data and metadata management into independent server clusters, similar to 
Lustre.  Unlike Lustre, however, metadata and storage nodes run entirely 
in userspace and require no special kernel support.  Storage nodes utilize 
either a raw block device or large image file to store data objects, or 
can utilize an existing file system (XFS, etc.) for local object storage 
(currently with weakened safety semantics).  File data is striped across 
storage nodes in large chunks to distribute workload and facilitate high 
throughputs.  When storage nodes fail, data is re-replicated in a 
distributed fashion by the storage nodes themselves (with some 
coordination from a cluster monitor), making the system extremely 
efficient and scalable.  Currently only n-way replication is supported, 
although initial groundwork has been laid for parity-based data 
redundancy.

Metadata servers effectively form a large, consistent, distributed 
in-memory cache above the storage cluster that is extremely scalable, 
dynamically redistributes metadata in response to workload changes, and 
can tolerate arbitrary (well, non-Byzantine) node failures.  The metadata 
server takes a somewhat unconventional approach to metadata storage to 
significantly improve performance for common workloads.  In particular, 
inodes with only a single link are embedded in directories, allowing 
entire directories of dentries and inodes to be loaded into its cache with 
a single I/O operation.  The contents of extremely large directories can 
be fragmented and managed by independent metadata servers, allowing 
scalable concurrent access.

Most importantly (in my mind, at least), the system offers automatic data 
rebalancing/migration when scaling from a small cluster of just a few 
nodes to many hundreds, without requiring an administrator carve the data 
set into static volumes or go through the tedious process of migrating 
data between servers.  When the file system approaches full, new nodes can 
be easily added and things will "just work."  Moreover, there is an 
unfortunate scarcity of 'enterprise-grade' features among open source 
Linux file systems when it comes to scalability, failure recovery, load 
balancing, and snapshots.  Snapshot support is a particularly sore point 
for me as block device-based approaches tend to be slow, inefficient, 
tedious to use, and unusable by the cluster file systems.

The following paper provides a good overview of the architecture:  

 http://www.usenix.org/events/osdi06/tech/weil.html

Although I have been working on this project for some time (it was the 
subject of my PhD thesis), I am announcing it to linux-fsdevel and lkml 
now because we are just (finally) getting started with the implementation 
of an in-kernel client.  There is a prototype client based on FUSE, but it 
is limited both in terms of correctness (due to limited control over cache 
consistency) and performance.  I am hoping to generate interest among file 
system developers with more kernel experience than I to help guide 
development as we move forward with porting the client to the kernel.

The kernel-side client code is in its infancy, but the rest of the system 
is relatively mature.  We have already been able to demonstrate excellent 
performance and scalability (see the above referenced paper), and the 
storage and metadata clusters can both tolerate and recover from 
arbitrary/multiple node failures.  I would describe the code base 
(weighing in at around 40,000 semicolon-lines) as early alpha quality: 
there is a healthy amount of debugging work to be done, but the basic 
features of the system system are complete and can be tested and 
benchmarked.

If you have any questions, I would be happy to discuss the architecture 
in more detail.  The paper mentioned above is probably the best overview.  
The source and more are available at the project web site,

 http://ceph.sourceforge.net/

Please let me know what you think.

sage
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Related links:

Archive of above thread

Tags: Ceph, filesystem, HA, network filesystem, POSIX, Sage Weil

This is exciting

November 15, 2007 - 12:15pm

sayler

I'm always interested in watching research projects which get positioned as real world solutions -- and my favorite real world solutions are those that make it into Linux.. ;-)

I really like brick-based storage systems, and it would be wonderful to get one that
1) Works as a Real Filesystem(tm). (eg, not GFS)
2) Has a sane license (though who knows what IP encumbrance systems like these face)
3) Actually gets used by real people.

If made production-ready, a system like this would be perfect for layering with Linux-Vserver/OpenVZ in a ISP environment with lots of independent virtualized systems.

Hosting Ceph on SourceForge, using Subversion

November 15, 2007 - 2:28pm

Jakub Narebski (not verified)

I see that Ceph is hosted on SourceForge and uses Subversion as version control system. Wouldn't it be better to use Git as SCM, as it is the version control system Linux uses? For kernel work, you can I think use kernel.org for hosting, for GPL work you can use Savannah (non-GNU section) which has contrary to SourceForge Git support.

Linux coupling not very tight

November 16, 2007 - 10:48am

Anonymous (not verified)

CEPH seems to largely live in user-space. So the Linux kernel coupling is not that tight. In fact, the current alpha code only has a user-space client (optionally interfacing with FUSE). When production ready, and there is an in-kernel Linux client working, I assume the user-space/kernel-space code size ratio will still be something like 90/10.

This looks really cool.

November 15, 2007 - 3:42pm

Nony mouse (not verified)

This looks really cool.

GPFS does not rely on

November 15, 2007 - 8:02pm

Seo Sanghyeon (not verified)

GPFS does not rely on symmetric access.

Symmetric access to what?

November 16, 2007 - 7:11am

intgr

Symmetric access to what? If you're going bother with writing a comment, please make sure that someone else can actually understand what you're talking about.

If you're referring to symmetric access to a shared disk (as in SAN file systems), then neither does Ceph.

Sage Weil said, "In contrast

November 16, 2007 - 11:08am

Seo Sanghyeon (not verified)

Sage Weil said, "In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely on symmetric access by all clients to shared block devices, Ceph separates data and metadata management into independent server clusters, similar to Lustre."

I simply pointed out that he is wrong in that GPFS can operate without symmetric access to shared block device, similar to Lustre.

Ah, you're right, my

November 16, 2007 - 1:16pm

sage weil (not verified)

Ah, you're right, my mistake. It looks like GPFS doesn't always require SAN connectivity for all nodes... those with shared access to disks can act in a proxy-like role. Either that's relatively new or I missed that detail when first studying GPFS a few years back.

In any case, the more important distinction is that access to shared storage resources is not based on a block-based interface and DLM. Instead, Ceph separates data and metadata management (like most other object-based systems, IBM StorageTank, and others). This makes it easier to distribute more work to storage nodes (e.g., local block allocation decision, data replication, failure detection and recovery), improving scalability.

Navigation

Job board

Latest forum posts


Easter Eggs in windows XP	1 hour ago	Windows
Disabling Block IO retries	6 hours ago	Linux general
How can I diff two different kernel sources from kernel.org?	16 hours ago	Linux general
I'm building linux with LFS	1 day ago	Linux general
Compiling kernel 2.6.8.1, net/ipv6/route.c, Error: request for member 'lock' in something not a structure or union	1 day ago	Linux kernel
linux soft reset - is it possible?	2 days ago	Linux kernel
Linux multicast forwarding	2 days ago	Linux kernel
Zetera Z-San Filesystem and Devices - Any Linux Interest?	2 days ago	Linux kernel
Resetting the bios password for Toshiba Laptop	2 days ago	Hardware
Toshiba Satalite a100 bios password locked	2 days ago	Hardware

Show all forums...

Colocation donated by:

Online users

Syndicate