Optimized Backups for Physical and Virtual Machines 4
My old backup method was a little cumbersome. To ensure a good backup set, I’d take down the virtual machine, mount the VM storage on the host (Xen), then perform an rdiff-backup of the entire file system, before bringing the VM back up again. This happened daily, automatically, around 3:30am. It has been working for over 3 years with very few hiccups. I’ve had to restore entire VMs and that has worked too. One day I needed to restore the Zimbra system ASAP. From the time I decided to do the restore until end-users could make use of the system was 20 minutes. That’s pretty sweet in my book.
There are some issues with the current setup.
- Backups are performed locally, to a different physical disk before being rsync’ed to the backup server. This is necessary because the backup tool versions are different and incompatible between Ubuntu 8.04 and 10.04 LTS servers.
- Each system is completely shutdown for some period of time during the backup process. It is usually 1-4 minutes, but still that is downtime.
- Most of the systems are still using 8.04 paravirtual machines under Xen. A migration of some type is needed to a newer OSes. I should use this opportunity to make things better.
- Some of the systems are running old versions of software which are not up to current patch levels. I guess this happens in all IT shops. None of that is available outside the VPN, so the risks are pretty low.
think I can do better.
Better How?
- More efficient on storage
- More efficient on networking
- More efficient on backup time, reduce downtime.
- Backup only the necessary files, not the entire VM
- Be able to restore easily. This is key. Backups without restores are just like swimming against a waterfall. Useless.
- Use HVM virtualization. This further decouples the VM from the hypervisor. Running vastly different OSes and kernels. System upgrades and migrations should be easier.
- Support all the newer VM capabilities. Live migrations, easier replication.
KVM VMs and New rdiff-backup Techniques
Most of the VMs here have a single purpose. Zimbra runs in 1 VM. Only those things that Zimbra needs are installed on that system, besides firewalls, attack and system monitoring stuff. It has Zimbra on it and that’s pretty much it. Why do I need to backup 1-3GB of OS files at all? The important files are all related to Zimbra running. These include:
- crontab for Zimbra and root
- currently installed list of packages
- currently installed list of CPAN modules
- currently installed list of Ruby Gems
- current hardware list (any specialized disk and network setups)
- /etc
- /var/www (if needed)
- /usr/local (monitoring software and DBs are here)
- /opt (Zimbra and all supporting files for it)
- /var/log/special (remote access log files)
- /var/lib/special (remote use rddfiles)
- HOME for root (HOME for other important accounts too)
That’s it. With these areas backed up, I think a complete restore to a stock OS image will completely recover the system. Tests will validate this belief.
Storage Efficient
Obviously, if we don’t backup the entire OS, apps and data, then we will be more efficient than the other backup method being used. How much more efficient?
This blog runs on a KVM VM. It is using 3.2GB of storage for everything.
Backing up just the important areas listed above requires just 760MB. I suspect that over time, say with 30-90 days of daily incremental backups, only 1GB will be required.
Network Efficient
I’m still using rdiff-backup, so only changed files will be reverse incrementally backed up. The main things included are listed above, each VM is a little different in the areas, but the same script is used. The largest file that changes is the DBMS. The backup script keeps track of the elapsed time) the last backup for this blog VM. It took about 90 seconds. Under Xen, the last backup log there show the same VM backup took 3m:25s – that’s shutting down the VM, doing the full system backup (incremental) and restarting the OS.
Safety
Before backing up a DBMS, we need to automatically either
- shutdown the DBMS to ensure it is quiesced or
- create a DBMS dump file with all the data which can be restored
Most other programs and files can safely be backed up while still running. It is just the DBMSs which may be corrupted that the dump file will be needed for restore activity.
Less Downtime
Under this new method (new to us anyway), we won’t be taking the entire VM down for a backup. That will save 30 to 60 seconds, but it will introduce other issues that we haven’t had to deal with due to long running servers. We will deal with those issues as they arise.
Because we are backing up about 4x less data, the unavailable periods will be less too.
Because the VMs and backup server are running the same OS, they have the same version of the backup tool, rdiff-backup. No local backups are needed. Pushing directly to the backup storage will take less overall time than the way we did it previously. This will free up an entire 500GB HDD in a VM server. It will probably be used for RAID1 storage for higher redundancy than we currently have. In the future, these incompatibilities may become an issue – for example if/when we migrate the hypervisor to 12.04 LTS release and each VM is still on 10.04 or 8.04 instances. There are lots of possible solutions to address rdiff-backup incompatibilities. Including:
- NFS mount the backup area
- sshfs mount the backup area
- local rdiff-backup repos, then rsync to the storage server (similar to what we do now)
- Others …
Practicing with unimportant VMs to find workable solutions is our plan of attack.
Restoration
Well, this is the big question that is still to be answered. If there’s a failure and I need to rebuild the VM from scratch, how quickly can that be competed AND then how quickly can the backed up data be restored. Can all that be scripted? I’m certain there’s an optimal way to handle this. Perhaps
- a monthly complete VM image (complete snapshot) is manually stored with the backups or
- the base OS instance that all our servers are built from is used or
- reinstall the base OS (ssh-server only), all listed packages, then restore the backed up areas and finally install all listed cpan, ruby-gems and other modules. Doing that will probably take more time to restore than the current method, but probably not more than an hour (I’m guessing). The restore process needs to be script-able, so it works at 4am. I’ve been using a method similar to this for my desktop OS for a few years. It works fine for a new desktop, but may not work good enough for a production server – the scripting will be critical.
I’m leaning towards the first choice since that will have more of the base system already installed. - Deploy something like Puppet to dynamically rebuild the VM full configuration. Puppet has a big problem, Ruby. We don’t run 50 of the same servers here. Each is completely unique so having a puppet config for each server seems like lots of overhead for very little payback.
Am I crazy for switching methods? Should I just find a way to keep doing the current method? I still have doubts about both methods under KVM. OTOH, my desktop backups have been working that way for a while. I’ve migrated to new OSes multiple times using those methods. About 2 yrs ago, I posted an article about rdiff-backips for HOME. That method does work.
Update 2/2014:
So I’ve been doing backups this way for a few years. It works well and is space, network and time efficient. When restoring from these minimal backups, I do something like this:
- reinstall the base OS (ssh-server only),
- update/dist-upgrade with aptitude,
- restore the backed up data areas (/usr/local, /home, whatever is in the backups)
- reinstall all listed packages,
- purge any packages like nano, gedit, compiz, unity-asset-pool, geoclue* (this is automatic via Ansible),
- finally install all listed cpan, ruby-gems and other modules (if perlbrew or rvm aren’t used)
Doing that takes about an hour. The restore process needs to be script-able, so it works at 4am. That script can be a program or simple steps to follow.
I’m highly sceptical the increased complexity is worth the increased efficiency. I’m with Warren Buffett here: If my success depends on a chain of things working correctly, I prefer mono-linked chains.
[ed: Almost removed this comment – very spam-like by not clearly referencing the article.]
FS: I’m beginning to agree with your statement concerning the added risk of failure as the restore process complexity increases. Working through deployment of this new backup method on a few systems has me thinking it may not be worthwhile in the long run. What’s 3GB extra per server backup?
I have it running on 2 new VMs now. Adding the 2nd one really cleaned up the perl script controlling everything. It’s the little things at this point. I’m pushing the backups to a central server. I think I need to pull to better manage I/O contention on the backup server. I’d feel better about pulling from a security standpoint too since some of the servers are internet facing.
Running multiple rdiff-backup commands per server is working, but getting reports about successful or failed completion and the total storage used is less than ideal. Need to switch over to a single command with judicious include and exclude sections. This will increase the downtime for the primary service on each machine from just that required to backup the DBMS and program areas, to include all the other system settings under /var and /etc too. In theory, that shouldn’t be too long – in theory.
I’ve also discovered that there are different versions of CPAN installed on the different systems here. The only reason that’s an issue at all is because the older-ubuntu-packages don’t include the -l switch to list locally installed cpan packages. It may be safe to assume that if I didn’t use perlbrew to have control over the specific version of perl, then we probably are not using cpan on that system. That just sounds like a big assumption that might not turn out as true.
Still, backing up just 1/4th the total data and being able to rebuild a system is enticing.
Seeing what others over at Debian Admin are doing for backups has me thinking we are all acting like software developers and not administrators. Being lazy is a core value for a good system administrator.
You might need to add your username to the Backup Operators group. It sodnus like you are NOT a local administrator. You should log in to the computer as the Administrator and then follow these steps:To add a user to the Backup Operators groupOpen Computer Management. (right click on my computer and choose manage)In the console tree, click Groups.Where?Computer ManagementLocal Users and GroupsGroupsIn the details pane double-click Backup Operators.On the Backup Operators Properties dialog box, click Add.On the Select Users or Groups dialog box, next to Name, type the domain and user name of the person you want to make a backup operator. Type the domain name first, then a backslash, then the user name. For example, \LondonWilliam. (in this case this is your computer nameuser)Click Add, and then click OK. ImportantTo add a user to the Operators Group on a domain controller you must use Active Directory Users and Computers. For more information, click Related Topics. NotesYou must be a member of the Administrators group to designate users as Backup Operators.To open Computer Management, click Start, point to Settings, and then click Control Panel. Double-click Administrative Tools, and then double-click Computer Management.
@Akiko: Good idea …. if I were actually running MS-Windows. Further, since groups are not part of Windows OSes that aren’t Professional or higher and ActiveDirectory, AD, definitely is not, this is impossible, even on Windows.
From the User groups in Windows help page:
OTOH, for people running Windows-Professional or higher, adding yourself to the backup operators group might be necessary. I dunno.