Your Linux Data Center Experts

Last night I set up a test cluster to play around with Ceph, a clustered, distributed, fault-tolerant file-system. Unlike the last time I tried it, around 6 months ago, this time I was able to get a file-system up and running. :-) My testing was far from thorough, but I did try writing data, and stopping and starting daemons on multiple machines to see how it reacted. I never lost data, though at some times I lost access to it until I restarted the cluster.

Read on for more details of the cluster, it's configuration, and my testing.

Cluster Description

My test cluster is 4 Atom 330 (dual core 1.6GHz with hyperthreading) machines with 1 or 2GB of RAM and 500GB discs. I set them up with the Ubuntu Maverick beta because of it includes a very recent kernel and btrfs. btrfs is required for recoverability of Ceph.

I then just did an “apt-get install ceph” – it's in the repos.

Ceph Configuration

For the configuration, I used the example “simple” configuration from the ceph package and modified it as follows:

  • I commented out the “auth” and “keyring” lines. For my test cluster, I didn't want to deal with any issues that might be related to auth.
  • I created 3 “mon” entries based on the example. You need an odd number of mon hosts.
  • I created 4 “mds” entries, one for each host, based on the example.
  • I created 4 “osd” entries based on the example.
  • In the “osd[0-4]” sections I used “btrfs devs = /dev/sda4”, which was a 450GB partition I created just for Ceph testing.
  • In the main “osd” section, I added “osd journal size = 100” below the “osd journal” line. Without this, I was getting an error while trying to create the file-system.

I put this configuration file in “/etc/ceph/ceph.conf”.

I was using the standard location, so on each node I did a “mkdir /data”.

Making the File-system

You need SSH agent forwarding and the ability to login to all machines in the cluster via SSH. I already had my keys in place, so I just logged into my main host (called “test1”) with “ssh -A root@test1”.

The Ceph configuration file needs to be copied to each system via a command like this for each system: scp /etc/ceph/ceph.conf root@test2:/etc/ceph/

Finally, I created and started the file-system with: mkcephfs -c /etc/ceph/ceph.conf -a –mkbtrfs -k /etc/ceph/keyring.bin

This makes around 30 to 40 SSH connections (my SSH agent is configured to ask me for confirmation for every attempted connection, so I had to acknowledge each of these). Once this completed, Ceph was up and running with 1.5TB of storage across 4 servers.


To access the file-systems I used the Ceph FUSE module: cfuse -c /etc/ceph/ceph.conf /root/ceph/


At this point I could then go into /root/ceph and write files. One thing I ran into was I started off doing “dd if=/dev/zero of=bigfile”, and found this was running very slowly, say 4MB/sec. This is just an artifact of the small default block size in dd though. I tried using “bs=1024k”, and my write performance went up to 28MB/sec.

I had the Ceph FUSE mounted on multiple systems and was able to write on one and read on another. That worked quite nicely.

I also tried stopping one of the nodes via “/etc/init.d/ceph stop” to see how it handled a node falling out of the cluster. On one test it was able to continue after a small hiccup. In another test, it completely hung the cluster, it still wasn't responding after 10 minutes. But, I had active writes going on when I killed the mon/osd/mds node.

Taking two nodes down definitely did cause problems.

However, in all cases a reboot of all the nodes resulted in ceph coming back up and access to the data I had previously written.

Obviously, I'd need to spend more time with seeing how to take a node out for maintenance and whether there were problems there that needed to be addressed.


Ceph is pretty new, but it's looking very promising. It definitely did seem to work as a clustered file-system, though I was hoping for it to just keep plodding along when I did a restart of a node.

I don't know how much of that hanging was Ceph itself and how much was the Ceph FUSE I was using to access it. Use as a simple block-device (which qemu can use for storing virtual file-systems) or as a simple object store may be more resilient than the file-system layer. That'd be worth testing.

comments powered by Disqus

Join our other satisfied clients. Contact us today.