Last updated 2008/03/18
Return to the Suffield Academy Network Documentation Homepage
A printable version of this document is also available.
Suffield Academy provides networked disk space for all of its users, and encourages its use as part of a regular backup strategy. We back up the entire disk array nightly to another hard disk as part of our disaster recovery plan.
Unfortunately, we do not have enough space to archive each of these full nightly backups. So, while we are protected against the server crashing, we do not have the ability to recover files from a particular point in time.
To remedy this problem, we designed a custom backup script to satisfy the following criteria:
Snapshot backups are supported by most backup packages. However, integrating user directory exclusion and reporting lead us to customize our own backup script.
Our script uses the rsync utility to perform the actual backups,
though other utilities (such as cpio) could be substituted with
relative ease. The script is written in Perl, and uses a few
external modules for user reporting (LDAP and SMTP).
Note that this script is intended for backups to hard disk (not tape). Because it stores full snapshots of the backed up files, yet excludes common files, the backups must all reside on a single disk so that links (inodes) may be shared between files.
(This section deals with the design considerations for the script. If you just want to start using the script, skip down to the usage section.)
While searching for methods to create snapshot backups, we found an excellent strategy for backing up only incremental changes. It involves using hard links on the filesystem to store redundant (e.g., unchanged) files, and rsync to transfer only files that change between backup sessions. Please read the paper for more information; a discussion of the technique is beyond the scope of this document.
We investigated existing solutions that use this strategy (including a program called rsnapshot, which looked very promising. Based on the features and existing code base, we decided to adopt this strategy for our backups.
We would have liked to use an existing solution for backing up, but encountered a few significant issues that could not be resolved without writing custom code:
Because of these issues, we needed to write our own code to perform the backups. The project started as glue code to tie together existing solutions, but it quickly became clear that a fully custom solution would be needed to address all of the issues.
In the sections below, we discuss each of the issues in more detail.
Most Macintosh computers running OS X use the HFS+ filesystem for storing data. HFS+ has two features which frustrate file transfers, especially to non-Mac OS X systems: file metadata (especially type and creator codes), which have no counterpart on other file systems, and forked files, which split files into several unique parts. Because other filesystems do not use this approach, storing HFS+ data correctly becomes much more difficult.
Additionally, many of the common Unix-based utilities that come with
Mac OS X 10.3 and earlier are not aware of these extra HFS+
attributes. This includes rsync, cp, tar, and cpio,
which are all common backup utilities on Unix machines.
With the arrival of Mac OS X 10.4 (Tiger), many of these issues have been mitigated. 10.4 comes with new versions of the common Unix utilities that are compatible with HFS+ files and metadata. Therefore, if you are running that version of the operating system, these concerns become less important.
If you cannot upgrade to OS X 10.4, some of the functionality can be emulated using special versions of the Unix utilities:
We ended up choosing RsyncX as our core backup utility. All machines involved in the backups ran Mac OS X, so it was not a problem to install the patched binary on all the servers. It preserved the most data of any rsync-based project, and it provided the cleanest upgrade to a Mac OS X 10.4-based solution (where the binaries are compatible with HFS+).
As we migrate to Mac OS X 10.4, we simply change a configuration parameter to use the new native rsync package, instead of RsyncX.
Manual patch to Apple's sources:
http://www.lartmaker.nl/rsync/
Vanilla patches that handle HFS (call them EA):
http://www.onthenet.com.au/~q/rsync/
MacOSXHints article with instructions on patching your own vanilla sources with the rsync+hfsmode patches above:
http://www.afp548.com/article.php?story=20050219192044818
To save space in our backups, we needed the ability to exclude files from the backup based upon a user's quota status. We do not enforce a hard quota limit on our fileserver (to allow users to store large files temporarily), but we didn't want to waste space with large files that didn't need to be backed up.
When backing up user home directories, the script communicates with the LDAP server to find users that are over quota. If a user is over their quota, files are excluded from the backup (until the non-excluded files fit within their quota). When a user's files are excluded, their e-mail address is queried from the LDAP database and the user is notified that certain files were not backed up.
As we began testing the backup scripts, we encountered a problem when syncing large file trees. Rsync stores all the files to syncronize in memory before it begins the transfer. Due to the large number of files we were backing up, we exhausted available memory on the server before the backups completed. (Rsync uses about 100 bytes per file, and we were routinely backing up around ten million files, requiring over 1GB of RAM to process.)
To resolve this issue, our script performs a two-phase sync using rsync. The first phase syncronizes the directory tree to a specified depth (usually two levels). Then, the script iterates over all the leaves in that level of the directory tree, and syncronizes each separately. In the case of user home directories, this resulted in a reduction of file list size on the order of 2 to 3 orders of magnitude.
The rsync maintainers are aware of this problem in the software, and we hope that the issue will be resolved soon so we can perform a standard sync on large trees.
Rsync can perform backups via the network, and we have designed our scripts to allow this behavior as well.
Because our script performs various housekeeping duties (rotating directories, locating old directories to link against, etc.), remote backups must adhere to a specific way of doing things in order to work properly. The following conditions must be met for remote backups:
More information is available in the usage section of this document.
The scripts depend on the following packages to work correctly:
RsyncSnapshot.pm.
The scripts should be run as root, or a user with root-like privileges
(as might be run via sudo). While root is not strictly necessary,
many of the privilege-preservation and hard-link options of rsync require
root in order to function properly. We have only tested the script with
root privileges; any other configuration may or may not work.
Finally, Mac OS X users should ensure that the destination directory for their backups has Ignore Ownership and Permissions turned OFF. You can check this by choosing "Get Info..." on the destination volume. If privileges are not preserved, then rsync will assume that all the files have changed ownership (since, as far as it knows, they have), and every file will be retransmitted, making the backup non-incremental.
The script bundle has been broken up into several pieces in order to facilitate distribution and functionality:
RsyncSnapshot.pm (Download)-I) to tell it where the module is, or you must move the
module into Perl's standard search path. perl -V will print the
standard search path (the @INC array); you can add the module to
the appropriate directory.
rsync_snapshot_sender (Download)rsync_snapshot_receiver (Download)rsync_snapshot_local (Download)The backup scripts have been written to support customized operation based on configuration files provided to the script at runtime. The script should be invoked with the path to one or more configuration files as its arguments (note: the "receiver" script can only take one configuration file; it is assumed that a separate instance of the script will be invoked for each backup). The script will process each configuration file in turn, and backup the files based on the settings in the configuration file.
The configuration files are actually valid Perl files that will be evaluated in the scope of the main backup program. In their simplest form, the config files can override global values in the script that define which files to back up and where to back them up to. Additionally, the files may contain arbitrary Perl instructions for more complicated backups.
At the very minimum, each config file must override three variables:
Additionally, you can override variables that define the timestamp format of the date folders, SMTP and LDAP settings, and default exclude patterns. Here is a sample config file that overrides the default timestamp format:
# Description of this config
$CONFIG_DESCRIPTION = "Backup of Web tree";
# Backup root dir
$BACKUP_ROOT = '/Volumes/Snapshots';
# Root directory of user home dirs
@SOURCES = ('/Volumes/r1/Web');
# Use a shorter format for the daily backups
$DEST = strftime("%Y-%m-%d", localtime);
A full list of global variables that can be overridden can be found in
the documentation for the script. The script contains embedded
documentation, so running perldoc RsyncSnapshot.pm should
format the documentation for reading on your terminal. The section
titled GLOBAL VARIABLES contains the information on the config
file variables.
For more complicated setups, you can insert Perl code to perform other operations before the backup runs. For example, when we run a backup of the user directories, we first check to see which users are over quota, and exclude them from the backup procedure.
Our users.conf
(see the source repository) gives an example of calling extra code to
build an exclude list before performing the full backup. It includes
a custom exclude list, as well as a call to the quota routines to
calculate users over quota and notify them via e-mail.
Scripts can easily be run via cron or other automated utility.
If the debug level is set to a reasonable level (1 or 0), the output
from the script should only contain a summary of transfer statistics
and a line or two for each configuration. Setting the debug level
higher will give more incremental progress, but may result in
hundreds (or thousands) of lines of output.
For backups that do not leave the machine (e.g., go from one disk
to another), you should use the rsync_snapshot_local script. The
script takes one or more configuration files as its arguments, and it
will process each in turn.
The script should be run with enough privileges to access the original files and create the destination files.
For a remote backup, the rsync_snapshot_sender script must be run
locally on the machine that contains the original files. The script
takes one or more configuration files as its arguments, and it will
process each in turn.
The script should be run with enough privileges to access the original files.
Because the sender side does not perform housekeeping tasks, not all of the configuration options are strictly required (e.g., destination directory). For simplicity, however, we recommend specifying the complete configuration and using the same file on both the sending and receiving hosts.
Note that you must specify a remote host (SSH_HOST) and encryption
key (SSH_KEY) for remote backups to send correctly. Otherwise,
the script can be run just like the local version.
The next section describes encryption keys in more detail.
To receive files sent by the "sender" version of the script, you must
run rsync_snapshot_receiver locally on the machine that will store
the backups. The script takes exactly one configuration file as its
argument (if the machine receives more than one backup, it must be
called with a different config file in each case).
The script should be run with enough privileges to create and store the backup files. Note that certain options (e.g. privilege preservation) require root access to use.
Because the receiving side does not choose the files to back up, not all of the configuration options are strictly required (e.g., source directories). For simplicity, however, we recommend specifying the complete configuration and using the same file on both the sending and receiving hosts.
In some cases, you may wish to back up files to more than one location. While you could simply schedule two independent backups, this may be impractical for large backups. For example, our nightly user backups at Suffield involve scanning over 1TB of space for changed files. Scanning the whole disk again for a second backup is too time-consuming.
The scripts support what we call piggyback backups, where the script feeds a list of only the files that have changed to an external process. This way, a second (or third, or fourth) process can copy the changed files, without needing to scan the entire set of files.
Piggybacking works in one of two ways. The first is to specify a piggyback of type "piggyback". In this case, the script spawns a new rsync_snapshot script with another configuration file. You can use this to copy the changed files to another location using rsync_snapshot.
The second type is an external exec. The script will call the program you specify, appending a final argument that is the name of a temporary file containing all the changed paths.
All of these options are specified in the @LISTENERS array in a
config file. See the built-in documentation for more information.
The piggyback backup is only fed a list of changed files. Therefore, it is not a full snapshot, as the unchanged files are not considered (again, we do this to cut down on file scanning time). If you wish to have a full snapshot backup, you must either use a non-piggyback backup, or you'll have to manually link to previous backups on the receiving side.
We have a short script that will hard-link files from an existing
directory into another directory, allowing you to propagate unchanged
files from one directory to another. The script is called
propagate_hard_links, and you may
Download it from the website.
As an example, if the following were included in a configuration for a backup, it would launch a piggyback backup with the given configuration file for any changed files:
@LISTENERS = ( ["piggyback", "/mumble/my_piggyback_config.conf"] );
You may list more than one listener (it's a two-dimensional array), and each will be run concurrently.
To launch an external program, just pass the program in a perl-style exec() array:
@LISTENERS = ( ["/usr/local/bin/listener", "arg1", "arg2", "arg3"] );
The path to the changed-path file will be appended to the end of the list of arguments before the program is called.
Right now, we assume that any sender/receiver pair will communicate by tunnelling over an SSH connection. This is done both for privacy (SSH encrypts all traffic) and for security (SSH keypairs enable us to specify a different configuration for different backup sets).
This document assumes you are familiar with SSH public/private keypairs, their generation, and their use. For more information, see the ssh-keygen manual page.
For each backup configuration, you must generate a public/private keypair using ssh-keygen. When prompted for a password, you should use none (the script cannot use a password, so it must not be prompted for one when it is run).
Once the keypair has been generated, the private key should be moved to the sender machine, while the public key should be stored on the receiver machine.
On the receiver, the public key must be added to the file
~/.ssh/authorized_keys. Additionally, the key should be prefaced
with information that restricts its use to a specific machine and a
specific command. You should preface the key with something similar
to the following (lines wrapped for clarity; all options should appear
on a single line):
command="sudo /usr/local/bin/rsync_snapshot_receiver foo_host.conf", from="192.168.0.1",no-port-forwarding,no-X11-forwarding, no-agent-forwarding,no-pty <SSH PUBLIC KEY HERE>
This configuration restricts the key to only be allowed from the listed IP address (for security), and also ensures that only the listed command gets run (with the proper configuration). In this way, you force the receiver to use a particular configuration for each keypair. We use Sudo to execute the script as root.
We have created a simple script,
generate_autologin_keys,
which automates much of this process through an interactive shell
script. It is configured for use on our network, though it should be
simple to modify for your own site.
Recovering files from the backup is extremely simple. Because the files are copied directly to another hard drive, they can be recovered simply by copying them from the backup directories.
The script creates a timestamped directory each time it runs. Through the use of hard links, the each backup directory appears to contain a complete copy of all the files at the given time. The use of hard links means that only files that have changed actually take up space. So, while it appears that all files are backed up during each run of the script, in reality only the files that change use space.
To restore files, simply locate the files under the desired timestamp
directory. To recover files from an earlier date, simply use the
folder with the desired date. The symlink current-backup should
always reference the most recent backup directory, though sometimes
this may refer to a backup in progress (the previous-backup
symlink always refers to a backup that is guaranteed to have
completed, though it may not be the most up-to-date).
Because only changed files are stored in the backup, you should be able to store a large number of backups before needing to remove old ones.
To delete a backup, simply remove the directory you no longer need. Because of the way hard links work, only files that changed during the named backup will actually be permanently removed. Consider the following example:
Suppose we have five backup directories, numbered 0 through 5. If a
file named foo exists in all 5 directories, and did not change
between any of the backups, then only one copy actually takes up
space. The flip side is that if any of those directories
contains a reference to foo, then it will continue to take up
space. The hard links do not require a continuous sequence of
directories to work; each backup directory can be deleted without
disturbing the links from those before or after it.
So, deleting backups will save some space, but the space saved will only equal the size of the files that were changed or deleted after the backup (because the references persist into the current backups).
You might consider "thinning" the backup directories over time. For example, if you keep nightly backups, you might thin them out to weekly backups, and then monthly backups. Doing so will save space of any files that changed or were deleted during that time, while still preserving major point-in-time recovery options. Because the hard links do not require continuity, this "thinning" will not affect the integrity of the remaining backups.