Failover and High Availability

Last updated 2008/03/18

Return to the Suffield Academy Network Documentation Homepage

A printable version of this document is also available.

Introduction

Network infrastructure occasionally requires maintenance, no matter how well-designed it is. However, many organizations have come to depend on network services to such a degree that downtime adversely affects the entire organization. Thus, we must find a way to increase the availability of service, even when parts of the infrastructure are not operating.

In the network world, increased availability usually requires duplicate hardware, and additional expense. We don't have a lot of money to throw at the problem, so the approaches outlined in this document use a minimum of hardware and expense to accomplish their task. Obviously, you must decide what level of downtime your organization can withstand, and balance this with the added cost and complexity of a suitable failover system.

Terms and Definitions

Before we begin, some quick discussion of terminology:

Mac OS X IPFailover

Mac OS X Server 10.4 ships with built-in IP-based failover support. Out of the box, it's a warm-spare failover setup; the backup server will take over the IP address of the failed server automatically, but it is up to the administrator to configure the server to start any additional services that may be required.

Failover is implemented using two system-level daemons:

  1. heartbeatd runs on the primary server, and broadcasts "availability" packets at regular intervals. These packets serve as an announcement that the machine is up and running.

  2. failoverd runs on the secondary server, and listens for the broadcast packets from the primary. In the event that the broadcasts are not received, the secondary takes over the primary's IP address.

Apple requires the following to implement IP-based failover:

Again, Apple's solution only transfers the IP address from one server to another; you are responsible for ensuring that the secondary server contains the data and configuration necessary to actually take over the network services.

To help with this, Apple uses a customizable scripting framework to help you launch processes during a failover situation (more on this below).

Configuring Failover

For the purposes of this document, we'll be using the following sample names. You must replace the names and IP addresses with those of your actual equipment.

Public Network

First, ensure that both machines are on the same public subnet, and are able to reach each other. In our example, both machines are on the 10.0.0.0/24 subnet.

If you have enabled firewall software, ensure that it allows UDP traffic destined for port 1694 to reach the server. Note: Apple's built-in firewall entry for "IP Failover" incorrectly defaults to TCP traffic for this rule; you must also enable UDP.

Private Network

Next, connect both servers together via an independent private network. This can be over a second ethernet connection, firewire cable, or other network.

Ensure that the secondary network interface appears below the primary interface in the Network Settings preference pane; this ensures that the machine will only use the private network when the public network is down.

Also ensure that the private network has no DNS information specified. All DNS information should be obtained from the public network.

If you have enabled firewall software, ensure that it allows UDP traffic destined for port 1694 to reach the server. Note: Apple's built-in firewall entry for "IP Failover" incorrectly defaults to TCP traffic for this rule; you must also enable UDP.

Configuring the Primary Server

On the primary server, edit the /etc/hostconfig file and add a line containing the **broadcast addressess* of both the public and private subnets. Using our sample IPs from above, our line looks like:

FAILOVER_BCAST_IPS="192.168.0.255 10.0.0.255"

Save the file and reboot the primary server (or manually start the IPFailover service using SystemStarter).

The primary should now have a heartbeatd process running, and a tcpdump listening for port 1694 should show regular traffic from the server on both its public and private interfaces. If this is happening, move on to the next step.

Configuring the Secondary Server

On the secondary server, edit the /etc/hostconfig file and add lines defining the primary server's IP address, and the interface that should assume the address in the event of a failover. Additionally, you may specify an e-mail address to send notifications to when a failover occurs:

FAILOVER_PEER_IP="10.0.0.100"
FAILOVER_PEER_IP_PAIRS="en0:10.0.0.100"
FAILOVER_EMAIL_RECIPIENT="root@example.org"

The second line's syntax says that the address 10.0.0.100 should be added to the en0 interface. If your machine has multiple interfaces, specify the one that should take over the primary server's address.

Save the file and reboot the secondary server (or manually start the IPFailover service using SystemStarter).

You should now have a failoverd process running on the secondary server, ready to take over from the primary. You may test this by unplugging both network interfaces from the primary (or simply shutting it down). The secondary server should notice that the server is unavailable and take over the IP address. Additionally, you should receive an e-mail notification about the takover.

If you reconnect the primary server, the secondary should notice and relinquish its address within 15 seconds. Again, an e-mail notification is sent to confirm the change.

Failover Transition Scripting

The process described above handles the takeover of an IP address from a primary server to a secondary one. However, that's all it does; if your secondary server is not already running all the services that the primary uses, then the takeover won't help you.

For this reason, Apple has provided a scriptable framework so you can take specific actions whenever a secondary server takes over (or gives up) the primary's address.

File Locations

Apple's IPFailover scripts look for a directory in /Library/IPFailover named after the public IP address of the primary server. In our example, the secondary server would have a directory named:

/Library/IPFailover/10.0.0.100

This directory can contain several scripts, outlined below:

You may have as many of each script as you want, and they may perform any scriptable tasks. For example, you might use the scripts to start a service after aquiring the primary's address, and then stop the service when the primary comes back.

We've built a template directory containing simple scripts that are easily customized:

Suffield IPFailover Config Template

Peering Failover

It is possible to set pairs of machines up in a peering configuration such that each acts as the backup for the other. Simply add the requisite lines to /etc/hostconfig on both machines and they'll act as a backup for each other.

Using the Suffield Configs

We keep all of our script directories under version control. To use them, you must check out the repository for the primary host you're interested in onto the secondary server that backs it up.

Note: This document assumes you've set up basic IPFailover as described above. That means you've edited /etc/hostconfig and "vanilla" failover is working.

Check out the config from our Subversion repository, substituting the actual primary IP address for <PRIMARY_IP>:

cd /Library/IPFailover/

sudo svn checkout \
svn://svn.suffieldacademy.org/netadmin/trunk/software/failover/host_configs/<PRIMARY_IP>

This will check out the configuration directory named after the IP address and place a working copy in the /Library/IPFailover directory. You may change into this directory at any time and run svn up to merge in any updates to the configuration.