Rsync and rdiff-backup: Two great tastes that go great together. (tummy.com, ltd. Journal Entry)
tummy.com: we do linux

Thursday December 10, 2009 at 08:18
Subject: Rsync and rdiff-backup: Two great tastes that go great together.
Keywords: Backups, Technical
Posted by: Sean Reifschneider

I really like the idea of rdiff-backup, but the drawbacks kept stopping me from deploying it more widely. The nicest thing is that it stores deltas as files change, so if you have a large file that changes a little bit every day, rdiff-backup only stores the little bit that changes. If you use the rsync hard-link trick to keep historic data around, it duplicates the whole file every day which can quickly add up on a slowly changing multi-gigabyte database file.

Problems with rdiff-backup include:

  • Server and client versions need to be the same. In a mixed environment, this means you're going to have to maintain your own packages for many of CentOS 4, CentOS 5, Hardy, Karmic, Fedora 12... My personal backup server has clients that are most of those...
  • rdiff-backup doesn't deal very well with intermittently connected systems. If you have a big set of changes that takes several days to push up at a throttled rate, and you are disconnected part way through, it needs to start over from scratch.
  • Worse, this failure may require running the next rdiff-backup with a special option to clean up the broken backup directory.
  • No throttling like the "--bwlimit" option in rsync.

After giving this all some thought, I came up with the idea of using rsync to pull the data over the network, and using rdiff-backup to maintain the historic backup information. Read below for my experiences with this.

rsync is very good at pulling data from remote sites via slow cable modems and the like -- backing up our laptops to a central location. And using the "--partial --inplace" options means that a single huge file (like an ISO or virtual machine image) will eventually get pushed across, even if you rate limit it so the slow outbound connection at home doesn't get saturated.

Once the rsync finishes, run an rdiff-backup from the rsync destination directory, to another directory. This means that you don't have to worry about syncing the version of rdiff-backup, since it only runs on one machine. Unfortunately, it does mean that you have another copy of all the files, doubling the space used.

I had this clever idea of hard-linking the rdiff-backup directory files to the rsync destination files, but that's incompatible with "rsync --inplace", and also rdiff-backup seems to go through and break the hard links anyway.

So, the question is whether the deltas save enough space to make up for the duplicate copy of the current system data. My experience with rsync hardlinks and also with BackupPC makes me think that it probably does in most cases.

Because it's running locally, I should never have to worry about an incomplete rdiff-backup run. If I'm running rdiff-backup directly to my laptop, I do have to worry about it not being complete before I do a reboot, go to or from the coffee shop, etc...

Another thing I'm trying out, is having multiple rdiff-backup directories to have different intervals for backups. So on the first day of the month I'll rdiff-backup to a "monthly" directory, and other days I'll go to a "daily" directory. I could do the same for the first day of the week, but currently I'm just doing daily (keeping for 30 days) and monthly in case I need to go back a really long time.

With the hard link trick and rsync, or even BackupPC (which uses it's own rsync implementation so that it can do deduplication and compression of the stored files), the amount of space required for keeping a reasonable amount of history just explodes. For servers in particular I like to keep a week or two of daily backups, at least 6 weeks of weekly, and at least a year of monthly history.

So far I've been running this configuration on around a dozen machines for 3 weeks. It's worked as well as I was hoping it would. Once I got the controlling script worked out, it's been maintenance free and has just worked. There have been no issues with rdiff-backup getting upset.

Of course, what I really want is to use snapshots to manage the deltas between rsync backups. But, LVM snapshots just won't cut it there, so I'm kind of stuck waiting until btrfs matures.
(Post Reply)

Comment
Paul Mack
Subject: rsync with --link-dest
You could you rsync with --link-dest "link destination"

I have rsync setup with lvm snapshots and it works very well. I have been running this for about 6 months without any issue.

check out the following URL for more info. http://www.mikerubel.org/computers/rsync_snapshots/

Paul...

Comment
Author: Sean Reifschneider
Subject: Nope, hard-links are what I'm trying to avoid.

--link-dest is exactly the sort of "linking trick" that I reference in my original post. It works fine for fairly small sets of data, but it's quite expensive to remove these archived backups (requiring hundreds of thousands or millions or more file delete operations, directory traversal, etc).

The biggest issue is, say you have a 20GB file that every night has 100KB appended to it. Or 100KB within it updated. You know, a fairly typical database file. And you are keeping 30 incremental copies...

With the hard links, this requires 600GB of storage.

Using rdiff-backup or ZFS snapshots, you end up using more like 20.03GB for the same dataset. Or a space savings of 97%.

Disc space is cheap, but it's not that cheap when you're talking about 5 TB versus 150TB...

As far as LVM snapshots go, you have to know beforehand how much space one of these snapshots is going to require, or you have to overcommit and make the snapshot volume larger than you ever expect it to reach, or you have to set up the snapshots to automatically extend when they run short (but not out) of space.

As far as overallocating, I'm looking at having a thousand backup copies on one of our larger systems. Over-allocating by even just 1GB results in a wasted terabyte right there. And I probably can't guess that close to right. Or I have to snapshot the whole backup file-system, and count on each snapshot being rather large, but also that if I move a backup from one host to another, it's still going to have all of that data reserved (probably through the old snapshot copies) for the next year.

Oh, and if you have 30 snapshots of a piece of data and it changes, you have to write and keep 30 copies of that changed volume (one for each snapshot), or you have to have rolling snapshots (one snapshot snapshotting another snapshot, can you do that) and then only be able to trim off the ends, so no trimming from the middle -- you need rolling snapshots for each backup type).

If you have had good luck with setting up and managing 1,000 LVM based snapshots, I'd love to hear about it. However, it seems like it would be a maintenance and performance nightmare. I've toyed with trying it out, since I think LVM snapshots are more robust than ZFS or btrfs are right now, but I just haven't gotten up the urge to try it.

So, the mechanism I wrote about is similar in ideas to many of these, but I believe it's dramatically simpler than LVM snapshots while saving more space than hard links.

The target I'm shooting for is like ZFS snapshots. They automatically manage their space, so you don't have to guess at how much space is going to be used by a snapshot, and it gets allocated out of the main file-system. This is because as blocks are written, they are copied elsewhere (copy on write). And you can create many "light weight" sub-file-systems within the ZFS file-system.

So, I'd create a "backups" ZFS, and then within that make one file-system for each system. For each of those, every night I would take a snapshot. I would delete the snapshots as time went on such that I ended up keeping monthly interval snapshots beyond 6 weeks, and weekly beyond 14 days.

ZFS managed all the complexity behind the scenes.

Sean

Comment
Chris G.
Subject: Various thoughts
Hi. I'm reading your blogs sometimes to always find something interesting. With my startup, I'm planning out backup strategies for several boxes we hope to have more of in the future. One important point is what backup strategy/medium is to use. I think I gather my thought in the form of some questions I hope you find some time to answer in short.

- I've read earlier you are using BackupPC. For a few hosts it's working well for us. Did you not find this tool sufficient, having to look for alternatives (besides the ability of partial updates with other mentioned BU tools)?

- Have you ever thought of using tapes? The equipment is relatively expensive, but tapes are relatively cheap. And no worries of many files and hardlinks. There are worries of other issues, of course. But with that many hosts you're dealing with daily, I'd consider using them...

- Have you thought of using agent-based backups? Many of them can solve the ever-increasing monolithic file backup problems too (well, some of it, most particular being the backup of various database files on MS hosts and Outlook PST files, etc). There are free tools to choose from, too, like A(Z?)manda, Bacula, etc.

- I'm also wondering why you like to backup database files as-is (being one of your primary reasons looking for diffing backup tools). A database datafile or tablespace file might not be in sync with disk contents (TBH on even a moderately intensive workload, it never is), so it's practically impossible to create a consistent, working copy. A mechanism should be used that either syncs/locks database files during backup (which in turn causes a stall in DB access), or other methods should be considered, IMO, like dumping consistent backups of the DB in question, and backing up that dump. Or use facilities that are meant for consistent offline backups, like the binary logs of MySQL, etc.

- I've noticed you have changed your VPS infrastructure to VMWare some time ago (what were your company using before that btw?), do you have any recommendations to look for when doing incremental backups of VMs? We too plan on providing some VPS solution, the most probable solution we'll provide is OpenVZ and/or KVM-based virtualization. The latter uses images similar to VMWare.

- chris

Comment
Author: Sean Reifschneider
Subject: BackupPC is good for some things
BackupPC is a very nice tool for some situations. I was planning on deploying it for our laptop backups, but the workflow I developed for our normal backups is for Ubuntu and Ubuntu was having issues on the system I deployed for our laptop backups. CentOS 5, which I ended up putting on that host, uses packages that required quite a lot more work above what the existing workflow was.

So, that was the motivation that got me to spend some time trying this other backup mechanism. rdiff-backup has some very nice features.

As you say, some database files do need to be dumped or flushed to get a consistent backup. I was speaking primarily of database files that are append-only, or similar files like log files.

Yes, we have thought of using tapes, but they do not really fit our needs. We have also considered agent-based backup systems, but have dismissed them for similar reasons.

For backing up virtual systems we basically just treat them as if they were normal machines. That has served us well.

Sean

Comment
Chris G.
Subject: rdiff-backup
Sean, thanks for your replies. Well, I have basically 2 other problems with rdiff-backup (besides the ones you already noted); it doesn't compress and it can't currently back up ACLs and EAs. And on several setups we need to use them. OTOH I'm somewhat more familiar with tools that create monolithic archives + increments (yes, they're more vulnerable to disk errors), like dar which I started experimenting with lately. And, if possible I'm for a free, open source solution. Anyway, your insight and experience discussed here shed some more light to some other aspects too, thanks again.
Comment
Paul Mack
Subject: rdiff-backup
I took a look at rdiff-backup and made the Move to it. This makes a lot more sense then rsync with Hardlinks.

Thanx for turning me on to rdiff-backup.

Comment
Scott
Subject: Keep Rsyncs?
This looks intriguing to me, but I'm curious if you are keeping the rsyncs around (so you only need to xmit the rsync diffs before taking rdiffs) and incurring a ~100% storage requirement increase, or are you removing the rsync copy every time?

Thanks

Comment
Author: Sean Reifschneider
Subject: Yes, there are several rsync copies.
Yes, I keep several rsync copies: one for the daily rdiff incremental source and one for the monthly rsync source. I don't have the bandwidth to the backup server to do a full rsync every day -- it's limited by the 768kbps on my home network, and I have 4 or 5 machines to do every day.

So far it has been working extremely well. Once I got the initial setup issues ironed out, it hasn't had any problems.