ZFS Under Linux: A User Report (tummy.com, ltd. Journal Entry)
tummy.com: we do linux

Saturday July 05, 2008 at 15:00
Subject: ZFS Under Linux: A User Report
Keywords: Technical, ZFS
Posted by: Sean Reifschneider

Related entries:
   Why I Like ZFS. by Sean Reifschneider, Saturday July 05, 2008 at 02:38
   Putting it all together: The Ultimate Storage Box by Sean Reifschneider, Monday July 07, 2008 at 02:44
   A month of ZFS under Linux by Sean Reifschneider, Friday August 08, 2008 at 16:57

As was pointed out by Daniel Webb in a comment to my previous post, under Linux you have to use FUSE to use ZFS. He just replied before I had a chance to get the next post in this series out. :-)

We've been using ZFS under Open Solaris for the last year or two in our hosting business for backup servers. It has some really compelling features (beyond what I mentioned in my last post) when used for backups. While it has worked well, it hasn't been entirely trouble-free. For a home backup/storage server I wanted to use ZFS but I absolutely have to keep the data encrypted.

ZFS under OpenSolaris doesn't currently support on disc encryption, though they are working on it. Linux has very mature disc encryption support, it's in the stock kernels and many installers support it now. That plus me being very familiar with Linux prompted me to look at ZFS under Linux again. Read on for my user report.

I've built and tested ZFS on FUSE previously and it was working. I was planning on doing some serious use, and had several options (including running Linux+Crypto on the base machine, exporting the block devices via iSCSI, and running OpenSolaris on another machine or virtual). Because of this, I decided to start my research by looking at the mailing lists.

ZFS on FUSE hasn't gotten a real release in quite a long time, around 15 months at the time of this writing. However, on the mailing list I saw a healthy amount of discussion and regular fixes being applied to it. So for my current set of tests I started with the latest code from version control.

My test system was running on a system with CentOS 5 and 10 hard 250GB hard drives. Because of CentOS I was running an older kernel and FUSE, but things worked relatively well. I had some problems initially because one of my hard drives was having problems -- something I knew because of RAID issues that caused trashing of my file-system resulting in this system becoming available for use in testing ZFS. :-)

I tracked down that bad drive (over a year out of warranty), and things got better but I still ran into a couple of situations where the system would lock up while I was running multiple backups.

My theory on this was that with only 2GB of RAM I was just thrashing the system while running rsyncs. ZFS under FUSE is known to use a lot of memory in the first place, and rsync version 2 stores a full file list in memory. So I upgraded the test system to 3GB of RAM, and at that point I didn't have any problems. I also installed rsync version 3, which can do incremental file lists, which saved a ton of memory.

However, my performance was pretty limited. This had nothing to do with ZFS under FUSE. A Celeron 3GHz just doesn't have the huevos for keeping up with 10 encryption processes plus the ZFS checksumming, etc...

I finally decided that I was happy enough with my testing that I was ready to bite the bullet and start trying to deploy the final system.

I upgraded the system to 14 500GB drives and a quad core 2.4GHz Core 2 CPU, but with only 2GB of RAM currently. I left a 250GB drive in place for the system disc (previously I just saved 4GB on every drive and used the first two drives as a mirror for the system). On this I installed Ubuntu 8.04 (a LTS release similar to CentOS, but with more recent software since it was released only a few months ago.

I built the latest development checkout of ZFS for FUSE and set up the 14 500GB drives as a raidz2 (redundant storage with two parity drives) on top of the encrypted partitions.

I then copied over the "zfs send" copies of the file-systems I had created on my test system, around 400GB of data. These are low-level copies of the file-system snapshots, containing the backups I've been making of a bunch of our laptops. I loaded the dumps back into their respective ZFS mount-points with no problems, which I was happy about. These backups took weeks to complete, because they were coming over slim upload pipes and further rate-limited so that they wouldn't impact other use of our networks while backups were running.

I next copied 2.2TB of data over from my storage server. This took a couple of days, but copied over with no problems at all. I found my old storage server had only a 100mbps network adapter in it. I took it down to add a gigabit adapter, and then realized I had no more free PCI slots (because of several being used for 4-port SATA adapters). Which was probably just as well, the 3.2GHz Celeron in the sending computer could only handle around 30MB/sec with all the crypto going on...

At this point I have around 3TB on the system, so I started a "scrub" -- the ZFS equivalent of a RAID verify. I've done a few of them with no problems. It's running over 240MB/sec during the verify, which is perfectly reasonable. The OpenSolaris machines running native ZFS are seeing around 100MB/sec to 200MB/sec with no crypto, but those are 12 250GB drives or 8 500GB drives.

The only real gotcha is that on OpenSolaris you can access the snapshots via "/pool/fsname/.zfs/snapshotname". Under Linux you don't have access to the ".zfs" hidden name. So to access a snapshot you have to clone it. This is a very workable solution, but I had to spend a bit of time hunting around before I figured out what the story was.

So far I've had absolutely no issues with it. It's been running great. I've only been using ZFS under Linux seriously for about the last month, but so far things are looking really good.

Am I considering switching our OpenSolaris systems to Linux? Absolutely. Part of that is that we are much more familiar with Linux than OpenSolaris. Another part is that the hardware support under OpenSolaris is much more limited than under Linux. It was fairly painful to find SATA cards that were supported under OpenSolaris.

One final parting note, and this is true of ZFS under both platforms... The "zfs scrub" restarts whenever you create or delete a snapshot. The scrub is an important part of ensuring your data is happy and healthy, but if the time required to do the scrub is larger than the frequency with which you create snapshots, your scrub will never finish. With ZFS you can do neat things like create a snapshot, run a "zfs send" to replicate the data to another machine, then destroy the previous snapshot and repeat. However, if you do this frequently (say, via cron every minute), it's going to seriously mess with the scrub.
(Post Reply)

Comment
Daniel Webb
Subject: Very promising
Wow, this sounds very promising. Normally I wouldn't even consider being an early adopter of a filesystem, but in this case, ZFS is so much better than other filesystems with respect to integrity/robustness that I am very tempted.
Comment
Larry Hastings
Subject: Just in time!
Just last night I started ordering the parts for my new 10TB RAID. I was planning on going with OpenSolaris--solely so I could use ZFS RAIDZ2. I've been keeping an eye on ZFS-for-FUSE but didn't get the impression it was production ready. If I can run Linux, that's just fabulous news.

The Sun HCL listed a SATA card with a SI 3114 chipset; do those not work in your experience? 'Cause those are a dime a dozen.

Comment
doug
Subject: 32 bit or 64 bit?

Just wondering whether your set-up is 32bit or 64bit? I've heard ZFS is less buggy in 64bit

Comment
Ken Roberts
Subject: Hardware?
Hi Sean,

Thanks for such an informative post. I have long been interested in ZFS+Linux, and really appreciate a post from someone who has worked with it in production.

I may be building a big fileserver soon-- would you mind sharing details about your hardware (case, mobo, and sata cards)?

Thank you.

Comment
Author: Sean Reifschneider
Subject: Answers about the configuration...
The system is running 64-bit, primarily so that I could run virtualization in 64-bit if I decide I want to but also so that I could run the SMP F@H client on it (which requires 64-bit).

As far as the Silicon Image 3114 chipset, I initially started with those but had problems. That chipset only works with the non-RAID version of those boards, but all the boards I found had the RAID bit set. This may have been fixed since, but getting it working under OpenSolaris at the time required re-flashing the BIOS on the board, which I had no luck at. This is one of the benefits of running ZFS under Linux that I mentioned: Much more hardware support. If you want to have the option of running under OpenSolaris, you'll want to pick another board. My current system is not capable of running directly under OpenSolaris I'd expect.

As far as details about the setup, I plan to write up more on this soon. However, one of the cards I'm using is the one I wrote about a few days ago. Another is the Supermicro 8-port SATA PCI-X card. And finally I'm using 4 ports on the motherboard.

Sean