Thursday December 10, 2009 at 08:18
Subject: Rsync and rdiff-backup: Two great tastes that go great together.
Keywords:
Backups, Technical
Posted by: Sean Reifschneider
I really like the idea of rdiff-backup, but the drawbacks kept
stopping me from deploying it more widely. The nicest thing is that it
stores deltas as files change, so if you have a large file that changes a
little bit every day, rdiff-backup only stores the little bit that changes.
If you use the rsync hard-link trick to keep historic data around, it
duplicates the whole file every day which can quickly add up on a slowly
changing multi-gigabyte database file.
Problems with rdiff-backup include:
(Post Reply)
-
Server and client versions need to be the same. In a mixed
environment, this means you're going to have to maintain your own
packages for many of CentOS 4, CentOS 5, Hardy, Karmic, Fedora 12...
My personal backup server has clients that are most of those...
rdiff-backup doesn't deal very well with intermittently connected
systems. If you have a big set of changes that takes several days to
push up at a throttled rate, and you are disconnected part way through,
it needs to start over from scratch.
Worse, this failure may require running the next rdiff-backup with
a special option to clean up the broken backup directory.
No throttling like the "--bwlimit" option in rsync.
(Post Reply)
| Comment |
Paul Mack Subject: rsync with --link-dest |
You could you rsync with --link-dest "link destination"
I have rsync setup with lvm snapshots and it works very well. I have been running this for about 6 months without any issue.
check out the following URL for more info.
http://www.mikerubel.org/computers/rsync_snapshots/
Paul...
| Comment |
Author:
Sean Reifschneider Subject: Nope, hard-links are what I'm trying to avoid. |
--link-dest is exactly the sort of "linking trick" that I reference in my original post. It works fine for fairly small sets of data, but it's quite expensive to remove these archived backups (requiring hundreds of thousands or millions or more file delete operations, directory traversal, etc).
The biggest issue is, say you have a 20GB file that every night has 100KB appended to it. Or 100KB within it updated. You know, a fairly typical database file. And you are keeping 30 incremental copies...
With the hard links, this requires 600GB of storage.
Using rdiff-backup or ZFS snapshots, you end up using more like 20.03GB for the same dataset. Or a space savings of 97%.
Disc space is cheap, but it's not that cheap when you're talking about 5 TB versus 150TB...
As far as LVM snapshots go, you have to know beforehand how much space one of these snapshots is going to require, or you have to overcommit and make the snapshot volume larger than you ever expect it to reach, or you have to set up the snapshots to automatically extend when they run short (but not out) of space.
As far as overallocating, I'm looking at having a thousand backup copies on one of our larger systems. Over-allocating by even just 1GB results in a wasted terabyte right there. And I probably can't guess that close to right. Or I have to snapshot the whole backup file-system, and count on each snapshot being rather large, but also that if I move a backup from one host to another, it's still going to have all of that data reserved (probably through the old snapshot copies) for the next year.
Oh, and if you have 30 snapshots of a piece of data and it changes, you have to write and keep 30 copies of that changed volume (one for each snapshot), or you have to have rolling snapshots (one snapshot snapshotting another snapshot, can you do that) and then only be able to trim off the ends, so no trimming from the middle -- you need rolling snapshots for each backup type).
If you have had good luck with setting up and managing 1,000 LVM based snapshots, I'd love to hear about it. However, it seems like it would be a maintenance and performance nightmare. I've toyed with trying it out, since I think LVM snapshots are more robust than ZFS or btrfs are right now, but I just haven't gotten up the urge to try it.
So, the mechanism I wrote about is similar in ideas to many of these, but I believe it's dramatically simpler than LVM snapshots while saving more space than hard links.
The target I'm shooting for is like ZFS snapshots. They automatically manage their space, so you don't have to guess at how much space is going to be used by a snapshot, and it gets allocated out of the main file-system. This is because as blocks are written, they are copied elsewhere (copy on write). And you can create many "light weight" sub-file-systems within the ZFS file-system.
So, I'd create a "backups" ZFS, and then within that make one file-system for each system. For each of those, every night I would take a snapshot. I would delete the snapshots as time went on such that I ended up keeping monthly interval snapshots beyond 6 weeks, and weekly beyond 14 days.
ZFS managed all the complexity behind the scenes.
Sean
| Comment |
Chris G. Subject: Various thoughts |
Hi. I'm reading your blogs sometimes to always find something interesting. With my startup, I'm planning out backup strategies for several boxes we hope to have more of in the future. One important point is what backup strategy/medium is to use. I think I gather my thought in the form of some questions I hope you find some time to answer in short.
- I've read earlier you are using BackupPC. For a few hosts it's working well for us. Did you not find this tool sufficient, having to look for alternatives (besides the ability of partial updates with other mentioned BU tools)?
- Have you ever thought of using tapes? The equipment is relatively expensive, but tapes are relatively cheap. And no worries of many files and hardlinks. There are worries of other issues, of course. But with that many hosts you're dealing with daily, I'd consider using them...
- Have you thought of using agent-based backups? Many of them can solve the ever-increasing monolithic file backup problems too (well, some of it, most particular being the backup of various database files on MS hosts and Outlook PST files, etc). There are free tools to choose from, too, like A(Z?)manda, Bacula, etc.
- I'm also wondering why you like to backup database files as-is (being one of your primary reasons looking for diffing backup tools). A database datafile or tablespace file might not be in sync with disk contents (TBH on even a moderately intensive workload, it never is), so it's practically impossible to create a consistent, working copy. A mechanism should be used that either syncs/locks database files during backup (which in turn causes a stall in DB access), or other methods should be considered, IMO, like dumping consistent backups of the DB in question, and backing up that dump. Or use facilities that are meant for consistent offline backups, like the binary logs of MySQL, etc.
- I've noticed you have changed your VPS infrastructure to VMWare some time ago (what were your company using before that btw?), do you have any recommendations to look for when doing incremental backups of VMs? We too plan on providing some VPS solution, the most probable solution we'll provide is OpenVZ and/or KVM-based virtualization. The latter uses images similar to VMWare.
- chris
| Comment |
Author:
Sean Reifschneider Subject: BackupPC is good for some things |
BackupPC is a very nice tool for some situations. I was planning on deploying it for our laptop backups, but the workflow I developed for our normal backups is for Ubuntu and Ubuntu was having issues on the system I deployed for our laptop backups. CentOS 5, which I ended up putting on that host, uses packages that required quite a lot more work above what the existing workflow was.
So, that was the motivation that got me to spend some time trying this other backup mechanism. rdiff-backup has some very nice features.
As you say, some database files do need to be dumped or flushed to get a consistent backup. I was speaking primarily of database files that are append-only, or similar files like log files.
Yes, we have thought of using tapes, but they do not really fit our needs. We have also considered agent-based backup systems, but have dismissed them for similar reasons.
For backing up virtual systems we basically just treat them as if they were normal machines. That has served us well.
Sean
| Comment |
Chris G. Subject: rdiff-backup |
Sean, thanks for your replies. Well, I have basically 2 other problems with rdiff-backup (besides the ones you already noted); it doesn't compress and it can't currently back up ACLs and EAs. And on several setups we need to use them. OTOH I'm somewhat more familiar with tools that create monolithic archives + increments (yes, they're more vulnerable to disk errors), like dar which I started experimenting with lately. And, if possible I'm for a free, open source solution. Anyway, your insight and experience discussed here shed some more light to some other aspects too, thanks again.
| Comment |
Paul Mack Subject: rdiff-backup |
I took a look at rdiff-backup and made the Move to it. This makes a lot more sense then rsync with Hardlinks.
Thanx for turning me on to rdiff-backup.
| Comment |
Scott Subject: Keep Rsyncs? |
This looks intriguing to me, but I'm curious if you are keeping the rsyncs around (so you only need to xmit the rsync diffs before taking rdiffs) and incurring a ~100% storage requirement increase, or are you removing the rsync copy every time?
Thanks
| Comment |
Author:
Sean Reifschneider Subject: Yes, there are several rsync copies. |
Yes, I keep several rsync copies: one for the daily rdiff incremental source and one for the monthly rsync source. I don't have the bandwidth to the backup server to do a full rsync every day -- it's limited by the 768kbps on my home network, and I have 4 or 5 machines to do every day.
So far it has been working extremely well. Once I got the initial setup issues ironed out, it hasn't had any problems.