Your Linux Data Center Experts

It's been a month since I set up and started heavily using ZFS under Linux on my storage box, It's been working quite well. So well in fact that I set up a new machine to migrate 3TB worth of ZFS snapshots from an OpenSolaris system that we've been having problems with. That didn't go at all well. Read on for more details.

The storage box has been running without any problems. I have around 10 ZFS file-systems (under one pool) which I use for backups. Those pools have snapshots created every day on them, and old snapshots deleted after 7 days for daily snapshots, 6 weeks for weekly, etc… So I've only really hit the daily removals so far. Also note that every one of those snapshots I'm creating a clone of, which is the only way you can access the snapshots under Linux. There is no “.zfs” directory.

I also have another “storage” file-system which stores just a whole pile of other information, like copies of my photographs, misc data that we don't really need on our laptops, etc…

Every Sunday I run a “zpool scrub” on the pool. It's taking just under 4 hours to run with 2.5TB used on 14 500GB drives.

The storage server has really been quite solid. The only problem I've had was related to a power outage. We had a brief power outage (which we often do at our house), and it seems that somehow that made it past the UPS the system was on and freaked out either the drives or the SATA port multiplier. Once I did a reboot of the external SATA enclosure and the server, everything came back up and was happy. A “zpool scrub” resolved some checksum issues and everything was happy.

So I decided to set up a 8x500GB system to mirror this and copy over a bunch of snapshots from an existing backup server at our facility. That server is a system that was use for backing up the client machines that we host on the hosting branch of our company.

I was using “zfs send” and “zfs recv” to copy the existing data, around 3TB worth of it, from the existing system to the new system. Things were going along quite smoothly for a day or two, and I managed to copy over around 2TB of the data…

Then, the zfs-fuse daemon died during one of the sends. Possibly in relation to me attempting to kill a “zfs send” that was in progress… Something that's never been a problem with the OpenSolaris systems before.

It just goes down-hill from there… I restarted zfs-fuse, but the majority of the devices were showing up as being corrupted, I think one or two were showing ONLINE and the others were just toast. The “zpool status” message said “time to recover from backups”…

So something about how zfs-fuse was writing the data resulted in the majority of the volumes being trashed. I was running a pair of 4-drive RAID-Z1 volumes, FYI.

So, there definitely seems to be an issue related to zfs send/recv, and the failure mode is pretty dramatic.

I checked to see if there are any newer versions of zfs-fuse available, because when I started testing the storage box there was recent activity. Sadly, that recent activity is still the most recent activity, so this problem definitely hasn't been fixed in a newer release.

So, I've had some mixed results. The more typical use case is working very well, but my attempted deployment on another machine just completely fell over. Of course, you always should have backups, even if you aren't playing with fairly experimental technology, so this failure mode may not be a big deal in real use. I'm leaving my storage server as is, but am definitely making sure I have backups. :-)

comments powered by Disqus

Join our other satisfied clients. Contact us today.