Rdiff-backup Real Use 5

Posted by JD 10/20/2013 at 14:00

I like rdiff-backup. It isn’t perfect, but for my needs, it fits. I’ve written about it, mostly in abstract ways, over the years. Seemed like time to show a non-trivial example. Below is the command used to backup a major storage server here.

nice /usr/bin/rdiff-backup --include /root --exclude /Data/1TB/Other/VMs/Win7Ult/Win7Ult-Data2.img --exclude /Data/1TB/Other/VMs/Win7Ult/Win7Ult-os.img --include /etc --include /usr/local --include /export --include /Data/1TB --include /raid/algoloma_mirror --include /Data/win7ult/Quicken12 --include /var/www --include /var/lib/libvirt --include /var/log/nginx --include /var/lib/awstats --exclude-special-files --exclude / --exclude /Backups --exclude **/.gnupg --exclude **/.gvfs --exclude **/.cpan --exclude **/romulus --exclude **/regulus --exclude **/.cache --exclude **/.kde/share/apps/kpdf --exclude **/.gkrellm2 --exclude **/.miro/icon-cache --exclude **/.miro/mozilla --exclude **/.adobe --exclude **/.smplayer/screenshots --exclude **/.mozilla-thunderbird --exclude **/.vlc --exclude **/.mozilla/firefox/*/chrome/sagetoo --exclude **/.mozilla/firefox/*/sessionstore --exclude **/.mozilla/firefox/*/Cache / /Backups/romulus/
Sorry it is ugly. Inside the script, it is very easy to see – includes, excludes, and the target to receive the backup. To be certain the exact command used was put here, a print/echo was added to the script mirroring the exact statement built by the script. Most of the excludes are to prevent accidental inclusion of very large files.

Extra care was required to prevent the Win7 Ultimate VM files from being included. That is why those 2 exclude statements are near the start. Order matters.

This does not include the few setups that are needed to get all the metadata necessary to restore nor does it include many of the largest storage areas with media files.

The metadata like list of installed package files, cpan, ruby gem modules, crontabs, all get dropped under /root/backup/ before anything starts. The script also cleans up temporary objects that aren’t wanted in any backups.

Large media files don’t get rdiff-backup treatment. rsync is used for areas with media files that do not change much. Things get moved around, added and removed there, so keeping up with those changes is not best for backups, hence the use of rsync.

Don’t be afraid of rdiff-backup based on the command above. The command really could be as simple as:
rdiff-backup --exclude-special-files / user@remote:/Backups/{hostname}

You might test it using just /etc as the source. /etc is a critical directory with system settings AND relatively small. Very easy to backup and see how rdiff-backup stores files like a mirror with reverse-incremental data off the most recent backup.

Being able to restore a system from yesterday, last week or last month is very important to security.

Please give it a try, especially if you aren’t doing backups today. You will thank me, I’m certain.

Cleanup

Keeping backups forever would be nice, but isn’t realistic. Eventually, backup storage runs out, so managing that by removing older backups will be needed.
rdiff-backup --remove-older-than 60D /Backups/{hostname}

How much space is used?

1 backup takes the same storage as the original files, as expected, but every day files are added, modified, removed. Those changes are tracked, efficiently. 30 days of changes requires about 10% more storage than the first backup around here. So, if the source area is 20G, then the target backup storage would be about 22G for 30 days worth of backups. Your mileage may vary, based on the amount of data changed.

A few actual examples might be useful.

Tag          - mirror size - 30 day size
DMS         - 4.85G - 4.99G
Desktop    - 5.12G - 5.83G (60 days)
Redmine   - 16MB - 18.9MB (60 days)
rorDev       - 86.7MB - 208MB 
email-gw   - 11.9MB - 11.9MB
email         - 10.8G - 11.7G
rmt-desktop - 1.67G - 1.74G
blog           - 911MB - 1030MB 

This provides an idea of how efficient backing up metadata for the OS, plus settings and data for the important server data can be. The numbers above are REAL data for real Linux servers here pulled this morning.

Here is the real backup script (slightly modified for general use).

Get those backups going!

There are lots of other tools that can do similar things. Even if rdiff-backup doesn’t fit your needs, some other tool will. Keep looking, it is likely that there is a tool out there that meets your needs.

  1. JD 10/21/2013 at 10:28

    Sometimes we need to backup a DB. Here’s a script to do that, place the output in a date-stamped file that will be included in the rdiff-backup areas above. Simple. Don’t even need to take the DB or webapp offline to get a perfectly recoverable DB backup.

    #!/bin/bash
    DB_NAME=redmine_default
    BACKDIR=/root/backup/
    /usr/bin/mysqldump -u root \
        -p`cat ~root/bin/$DB_NAME.passwd.root` $DB_NAME | \
        gzip > $BACKDIR/${DB_NAME}_`date +%Y%m%d`.gz
    find  $BACKDIR -type f -name $DB_NAME\* -atime 60 -exec rm {} \;

    If there is more than 1 DB on the machine, make the DB_NAME parameter one of the passed in arguments to the script, $1 looks good for that.

  2. Will 10/21/2013 at 14:41

    Hi John,
    Sorry for hijacking this post to leave an unrelated comment, but I didn’t see any other contact methods here. I found this blog by going through your comments on Gawdfazasdfasdfo (*). I really just wanted to ask a general question regarding a lot of your Linux posts…I’ve been slowly migrating into the world of Linux, and I’m trying to find a good resource for “Best Practices”, and I am really enjoying the content I’ve found here. However, a lot of the information here seems to come from pre-2012 times, and I’m wondering how relevant they still are, or if there aren’t newer, better practices now.

    Anyways, a lot of rambling, tl;dr…Can you comment on the relevancy on a lot of your posts from pre-2012??

  3. JD 10/21/2013 at 18:14

    I’m always learning new things, so some articles may not be my current advice.

    CLI stuff for Linux doesn’t change much over the years, so I’m using the same commands today as I did in 1994. A few tools are on my shit list due to failures – fdisk – is one. Others have been replaced by newer commands … or rather, the old commands have been deprecated and we are stuck with the newer versions. nslookup is one of those – use dig. Upstart is slowly replacing the init.d/ scripts that have been working for 40+ yrs. Is it better with the newer methods? Is forcing the use of another layer over the /etc/resolve.conf better? I don’t believe it is, but …. everyone has an opinion.

    As a specific question about a specific post if you want a specific answer. At least my blog provides a published date. Sites that don’t drive me crazy!

  4. dbp 05/30/2014 at 04:37

    What I suggest is using LVM to snapshot backup whole system as a tarball or whole VM image, to get (1) zero downtime (2) simple restore. Of course in the cost of much bigger storage space required.

  5. JD 05/30/2014 at 12:02

    dbp – I’ve looked at snapshots and like the idea for HA servers.

    Where snapshots don’t make sense to me is when the RPO is 24 hrs and the RTO is over 1 hr.

    As an example, our email front-end server doesn’t have any data on it really. It is just the first line in bouncing unwanted email. 120 days of backups (the rdiff-backup way) is 27MB. The entire running system uses just 3.1G.

    How much storage is needed for 120 days of LVM snapshots for that backup?