Incremental Backup Script

Last updated 2008/03/18

Return to the Suffield Academy Network Documentation Homepage

A printable version of this document is also available.

Introduction

Suffield Academy provides networked disk space for all of its users, and encourages its use as part of a regular backup strategy. We back up the entire disk array nightly to another hard disk as part of our disaster recovery plan.

Unfortunately, we do not have enough space to archive each of these full nightly backups. So, while we are protected against the server crashing, we do not have the ability to recover files from a particular point in time.

To remedy this problem, we designed a custom backup script to satisfy the following criteria:

Snapshot backups are supported by most backup packages. However, integrating user directory exclusion and reporting lead us to customize our own backup script.

Our script uses the rsync utility to perform the actual backups, though other utilities (such as cpio) could be substituted with relative ease. The script is written in Perl, and uses a few external modules for user reporting (LDAP and SMTP).

Note that this script is intended for backups to hard disk (not tape). Because it stores full snapshots of the backed up files, yet excludes common files, the backups must all reside on a single disk so that links (inodes) may be shared between files.

Design Issues

(This section deals with the design considerations for the script. If you just want to start using the script, skip down to the usage section.)

While searching for methods to create snapshot backups, we found an excellent strategy for backing up only incremental changes. It involves using hard links on the filesystem to store redundant (e.g., unchanged) files, and rsync to transfer only files that change between backup sessions. Please read the paper for more information; a discussion of the technique is beyond the scope of this document.

We investigated existing solutions that use this strategy (including a program called rsnapshot, which looked very promising. Based on the features and existing code base, we decided to adopt this strategy for our backups.

We would have liked to use an existing solution for backing up, but encountered a few significant issues that could not be resolved without writing custom code:

Because of these issues, we needed to write our own code to perform the backups. The project started as glue code to tie together existing solutions, but it quickly became clear that a fully custom solution would be needed to address all of the issues.

In the sections below, we discuss each of the issues in more detail.

HFS Metadata

Most Macintosh computers running OS X use the HFS+ filesystem for storing data. HFS+ has two features which frustrate file transfers, especially to non-Mac OS X systems: file metadata (especially type and creator codes), which have no counterpart on other file systems, and forked files, which split files into several unique parts. Because other filesystems do not use this approach, storing HFS+ data correctly becomes much more difficult.

Additionally, many of the common Unix-based utilities that come with Mac OS X 10.3 and earlier are not aware of these extra HFS+ attributes. This includes rsync, cp, tar, and cpio, which are all common backup utilities on Unix machines.

With the arrival of Mac OS X 10.4 (Tiger), many of these issues have been mitigated. 10.4 comes with new versions of the common Unix utilities that are compatible with HFS+ files and metadata. Therefore, if you are running that version of the operating system, these concerns become less important.

If you cannot upgrade to OS X 10.4, some of the functionality can be emulated using special versions of the Unix utilities:

RsyncX (http://archive.macosxlabs.org/rsyncx/rsyncx.html)

A patched version of rsync that preserves HFS+ forks and type/creator codes. Does not preserve other HFS+ metadata, however. Only runs on Mac OS X, and requires that all machines (sending and receiving) use the patched binary.
rsync+hfsmode (http://www.quesera.com/reynhout/misc/rsync+hfsmode/)

A patched version of rsync that supports HFS+ forks. Does not handle type/creator codes. Will work with an unpatched receiver; forked files are automatically split into a format that other filesystems can store. If you don't require this cross-platform capability, RsyncX is probably a better choice.
hfs-tar (http://www.metaobject.com/downloads/macos-x/)

A patched version of tar that correctly stores HFS+ forks and metadata.
rdiff-backup (http://www.nongnu.org/rdiff-backup/)

A cross-platform script for performing snapshot backups. As of this writing, HFS+ support is present, but still in the experimental stage.

We ended up choosing RsyncX as our core backup utility. All machines involved in the backups ran Mac OS X, so it was not a problem to install the patched binary on all the servers. It preserved the most data of any rsync-based project, and it provided the cleanest upgrade to a Mac OS X 10.4-based solution (where the binaries are compatible with HFS+).

As we migrate to Mac OS X 10.4, we simply change a configuration parameter to use the new native rsync package, instead of RsyncX.

New Tiger Builds

Manual patch to Apple's sources:

http://www.lartmaker.nl/rsync/

Vanilla patches that handle HFS (call them EA):

http://www.onthenet.com.au/~q/rsync/

MacOSXHints article with instructions on patching your own vanilla sources with the rsync+hfsmode patches above:

http://www.afp548.com/article.php?story=20050219192044818

LDAP Integration

To save space in our backups, we needed the ability to exclude files from the backup based upon a user's quota status. We do not enforce a hard quota limit on our fileserver (to allow users to store large files temporarily), but we didn't want to waste space with large files that didn't need to be backed up.

When backing up user home directories, the script communicates with the LDAP server to find users that are over quota. If a user is over their quota, files are excluded from the backup (until the non-excluded files fit within their quota). When a user's files are excluded, their e-mail address is queried from the LDAP database and the user is notified that certain files were not backed up.

Rsync Limitations

As we began testing the backup scripts, we encountered a problem when syncing large file trees. Rsync stores all the files to syncronize in memory before it begins the transfer. Due to the large number of files we were backing up, we exhausted available memory on the server before the backups completed. (Rsync uses about 100 bytes per file, and we were routinely backing up around ten million files, requiring over 1GB of RAM to process.)

To resolve this issue, our script performs a two-phase sync using rsync. The first phase syncronizes the directory tree to a specified depth (usually two levels). Then, the script iterates over all the leaves in that level of the directory tree, and syncronizes each separately. In the case of user home directories, this resulted in a reduction of file list size on the order of 2 to 3 orders of magnitude.

The rsync maintainers are aware of this problem in the software, and we hope that the issue will be resolved soon so we can perform a standard sync on large trees.

Remote Backups

Rsync can perform backups via the network, and we have designed our scripts to allow this behavior as well.

Because our script performs various housekeeping duties (rotating directories, locating old directories to link against, etc.), remote backups must adhere to a specific way of doing things in order to work properly. The following conditions must be met for remote backups:

  1. The rsync connection must be tunnelled over an SSH connection.

  2. The SSH connection must not prompt for a password. Passwordless SSH keys are a simple way to do this.

  3. The receiving side must intercept the call to rsync and invoke our script instead (this ensures that directories get rotated, and other housekeeping work takes place before the files are received). Again, SSH keys allow this functionality to be specified.

More information is available in the usage section of this document.

Script Usage

The scripts depend on the following packages to work correctly:

rsync

The rsync binary must be installed on the system somewhere. The name does not have to be "rsync"; you may simply change the name used in the configuration file.

Unix::Syslog.pm

This is a reasonably standard perl module which allows the scripts to record debugging information directly to syslog on your system.

LDAP.pm (optional)

We include functionality to query an LDAP server for usernames, e-mail addresses, and quotas (to notify users when their files have been excluded from backups). If you do not wish to install the LDAP modules, simply comment out the "use LDAP;" line in RsyncSnapshot.pm.

The scripts should be run as root, or a user with root-like privileges (as might be run via sudo). While root is not strictly necessary, many of the privilege-preservation and hard-link options of rsync require root in order to function properly. We have only tested the script with root privileges; any other configuration may or may not work.

Finally, Mac OS X users should ensure that the destination directory for their backups has Ignore Ownership and Permissions turned OFF. You can check this by choosing "Get Info..." on the destination volume. If privileges are not preserved, then rsync will assume that all the files have changed ownership (since, as far as it knows, they have), and every file will be retransmitted, making the backup non-incremental.

Organization

The script bundle has been broken up into several pieces in order to facilitate distribution and functionality:

RsyncSnapshot.pm (Download)

This is the perl module that contains most of the functionality of the scripts. You should not need to edit the library file, unless you're correcting bugs or disabling functionality (such as commenting out LDAP.pm, which may not be available or desirable on all systems).

For proper operation, **the module must be installed on your Perl library search path**. You may either pass a command line switch to Perl (-I) to tell it where the module is, or you must move the module into Perl's standard search path. perl -V will print the standard search path (the @INC array); you can add the module to the appropriate directory.

rsync_snapshot_sender (Download)

This script uses the library to back up files on the local machine and send them to a "receiver" running on a remote machine. It does not attempt to rotate directories or perform other houskeeping; it simply finds the files that need backing up and sends them.

rsync_snapshot_receiver (Download)

This script uses the library code to receive files onto the local machine that have been sent from a remote machine. Before receiving any files, the script attempts to perform housekeeping tasks (rotating directories, pointing symlinks, etc). It then launches a special version of rsync that waits to receive files from the remote copy.

rsync_snapshot_local (Download)

This script is a combination of the scripts above, and is used when the sender and receiver are the same host. No network communication is attempted; files are copied directly using a local version of rsync.

Configuration Files

The backup scripts have been written to support customized operation based on configuration files provided to the script at runtime. The script should be invoked with the path to one or more configuration files as its arguments (note: the "receiver" script can only take one configuration file; it is assumed that a separate instance of the script will be invoked for each backup). The script will process each configuration file in turn, and backup the files based on the settings in the configuration file.

The configuration files are actually valid Perl files that will be evaluated in the scope of the main backup program. In their simplest form, the config files can override global values in the script that define which files to back up and where to back them up to. Additionally, the files may contain arbitrary Perl instructions for more complicated backups.

At the very minimum, each config file must override three variables:

  1. The description of this backup configuration.

  2. An array of source directories where files will be copied from.

  3. The backup root directory, where saved files will be stored (all files are automatically stored in a subfolder by date). For backups that send data to a remote machine, the remote hostname and SSH key information are also required.

Additionally, you can override variables that define the timestamp format of the date folders, SMTP and LDAP settings, and default exclude patterns. Here is a sample config file that overrides the default timestamp format:

# Description of this config
$CONFIG_DESCRIPTION = "Backup of Web tree";

# Backup root dir
$BACKUP_ROOT = '/Volumes/Snapshots';

# Root directory of user home dirs
@SOURCES = ('/Volumes/r1/Web');

# Use a shorter format for the daily backups
$DEST = strftime("%Y-%m-%d", localtime);

A full list of global variables that can be overridden can be found in the documentation for the script. The script contains embedded documentation, so running perldoc RsyncSnapshot.pm should format the documentation for reading on your terminal. The section titled GLOBAL VARIABLES contains the information on the config file variables.

For more complicated setups, you can insert Perl code to perform other operations before the backup runs. For example, when we run a backup of the user directories, we first check to see which users are over quota, and exclude them from the backup procedure.

Our users.conf (see the source repository) gives an example of calling extra code to build an exclude list before performing the full backup. It includes a custom exclude list, as well as a call to the quota routines to calculate users over quota and notify them via e-mail.

Scripts can easily be run via cron or other automated utility. If the debug level is set to a reasonable level (1 or 0), the output from the script should only contain a summary of transfer statistics and a line or two for each configuration. Setting the debug level higher will give more incremental progress, but may result in hundreds (or thousands) of lines of output.

Local Backups

For backups that do not leave the machine (e.g., go from one disk to another), you should use the rsync_snapshot_local script. The script takes one or more configuration files as its arguments, and it will process each in turn.

The script should be run with enough privileges to access the original files and create the destination files.

Remote Backups: Sending Side

For a remote backup, the rsync_snapshot_sender script must be run locally on the machine that contains the original files. The script takes one or more configuration files as its arguments, and it will process each in turn.

The script should be run with enough privileges to access the original files.

Because the sender side does not perform housekeeping tasks, not all of the configuration options are strictly required (e.g., destination directory). For simplicity, however, we recommend specifying the complete configuration and using the same file on both the sending and receiving hosts.

Note that you must specify a remote host (SSH_HOST) and encryption key (SSH_KEY) for remote backups to send correctly. Otherwise, the script can be run just like the local version.

The next section describes encryption keys in more detail.

Remote Backups: Receiving Side

To receive files sent by the "sender" version of the script, you must run rsync_snapshot_receiver locally on the machine that will store the backups. The script takes exactly one configuration file as its argument (if the machine receives more than one backup, it must be called with a different config file in each case).

The script should be run with enough privileges to create and store the backup files. Note that certain options (e.g. privilege preservation) require root access to use.

Because the receiving side does not choose the files to back up, not all of the configuration options are strictly required (e.g., source directories). For simplicity, however, we recommend specifying the complete configuration and using the same file on both the sending and receiving hosts.

Piggyback Backups

In some cases, you may wish to back up files to more than one location. While you could simply schedule two independent backups, this may be impractical for large backups. For example, our nightly user backups at Suffield involve scanning over 1TB of space for changed files. Scanning the whole disk again for a second backup is too time-consuming.

The scripts support what we call piggyback backups, where the script feeds a list of only the files that have changed to an external process. This way, a second (or third, or fourth) process can copy the changed files, without needing to scan the entire set of files.

Piggybacking works in one of two ways. The first is to specify a piggyback of type "piggyback". In this case, the script spawns a new rsync_snapshot script with another configuration file. You can use this to copy the changed files to another location using rsync_snapshot.

The second type is an external exec. The script will call the program you specify, appending a final argument that is the name of a temporary file containing all the changed paths.

All of these options are specified in the @LISTENERS array in a config file. See the built-in documentation for more information.

Notes

The piggyback backup is only fed a list of changed files. Therefore, it is not a full snapshot, as the unchanged files are not considered (again, we do this to cut down on file scanning time). If you wish to have a full snapshot backup, you must either use a non-piggyback backup, or you'll have to manually link to previous backups on the receiving side.

We have a short script that will hard-link files from an existing directory into another directory, allowing you to propagate unchanged files from one directory to another. The script is called propagate_hard_links, and you may Download it from the website.

Examples

As an example, if the following were included in a configuration for a backup, it would launch a piggyback backup with the given configuration file for any changed files:

@LISTENERS = (
  ["piggyback", "/mumble/my_piggyback_config.conf"]
  );

You may list more than one listener (it's a two-dimensional array), and each will be run concurrently.

To launch an external program, just pass the program in a perl-style exec() array:

@LISTENERS = (
  ["/usr/local/bin/listener", "arg1", "arg2", "arg3"]
  );

The path to the changed-path file will be appended to the end of the list of arguments before the program is called.

Encryption Keys

Right now, we assume that any sender/receiver pair will communicate by tunnelling over an SSH connection. This is done both for privacy (SSH encrypts all traffic) and for security (SSH keypairs enable us to specify a different configuration for different backup sets).

This document assumes you are familiar with SSH public/private keypairs, their generation, and their use. For more information, see the ssh-keygen manual page.

For each backup configuration, you must generate a public/private keypair using ssh-keygen. When prompted for a password, you should use none (the script cannot use a password, so it must not be prompted for one when it is run).

Once the keypair has been generated, the private key should be moved to the sender machine, while the public key should be stored on the receiver machine.

On the receiver, the public key must be added to the file ~/.ssh/authorized_keys. Additionally, the key should be prefaced with information that restricts its use to a specific machine and a specific command. You should preface the key with something similar to the following (lines wrapped for clarity; all options should appear on a single line):

command="sudo /usr/local/bin/rsync_snapshot_receiver foo_host.conf",
from="192.168.0.1",no-port-forwarding,no-X11-forwarding,
no-agent-forwarding,no-pty <SSH PUBLIC KEY HERE>

This configuration restricts the key to only be allowed from the listed IP address (for security), and also ensures that only the listed command gets run (with the proper configuration). In this way, you force the receiver to use a particular configuration for each keypair. We use Sudo to execute the script as root.

We have created a simple script, generate_autologin_keys, which automates much of this process through an interactive shell script. It is configured for use on our network, though it should be simple to modify for your own site.

Recovery

Recovering files from the backup is extremely simple. Because the files are copied directly to another hard drive, they can be recovered simply by copying them from the backup directories.

The script creates a timestamped directory each time it runs. Through the use of hard links, the each backup directory appears to contain a complete copy of all the files at the given time. The use of hard links means that only files that have changed actually take up space. So, while it appears that all files are backed up during each run of the script, in reality only the files that change use space.

To restore files, simply locate the files under the desired timestamp directory. To recover files from an earlier date, simply use the folder with the desired date. The symlink current-backup should always reference the most recent backup directory, though sometimes this may refer to a backup in progress (the previous-backup symlink always refers to a backup that is guaranteed to have completed, though it may not be the most up-to-date).

Deleting Backups

Because only changed files are stored in the backup, you should be able to store a large number of backups before needing to remove old ones.

To delete a backup, simply remove the directory you no longer need. Because of the way hard links work, only files that changed during the named backup will actually be permanently removed. Consider the following example:

Suppose we have five backup directories, numbered 0 through 5. If a file named foo exists in all 5 directories, and did not change between any of the backups, then only one copy actually takes up space. The flip side is that if any of those directories contains a reference to foo, then it will continue to take up space. The hard links do not require a continuous sequence of directories to work; each backup directory can be deleted without disturbing the links from those before or after it.

So, deleting backups will save some space, but the space saved will only equal the size of the files that were changed or deleted after the backup (because the references persist into the current backups).

You might consider "thinning" the backup directories over time. For example, if you keep nightly backups, you might thin them out to weekly backups, and then monthly backups. Doing so will save space of any files that changed or were deleted during that time, while still preserving major point-in-time recovery options. Because the hard links do not require continuity, this "thinning" will not affect the integrity of the remaining backups.