Last updated 2008/03/18
Return to the Suffield Academy Network Documentation Homepage
A printable version of this document is also available.
Network infrastructure occasionally requires maintenance, no matter how well-designed it is. However, many organizations have come to depend on network services to such a degree that downtime adversely affects the entire organization. Thus, we must find a way to increase the availability of service, even when parts of the infrastructure are not operating.
In the network world, increased availability usually requires duplicate hardware, and additional expense. We don't have a lot of money to throw at the problem, so the approaches outlined in this document use a minimum of hardware and expense to accomplish their task. Obviously, you must decide what level of downtime your organization can withstand, and balance this with the added cost and complexity of a suitable failover system.
Before we begin, some quick discussion of terminology:
Mac OS X Server 10.4 ships with built-in IP-based failover support. Out of the box, it's a warm-spare failover setup; the backup server will take over the IP address of the failed server automatically, but it is up to the administrator to configure the server to start any additional services that may be required.
Failover is implemented using two system-level daemons:
heartbeatd runs on the primary server, and broadcasts
"availability" packets at regular intervals. These packets serve as
an announcement that the machine is up and running.
failoverd runs on the secondary server, and listens for the
broadcast packets from the primary. In the event that the broadcasts
are not received, the secondary takes over the primary's IP address.
Apple requires the following to implement IP-based failover:
Again, Apple's solution only transfers the IP address from one server to another; you are responsible for ensuring that the secondary server contains the data and configuration necessary to actually take over the network services.
To help with this, Apple uses a customizable scripting framework to help you launch processes during a failover situation (more on this below).
For the purposes of this document, we'll be using the following sample names. You must replace the names and IP addresses with those of your actual equipment.
10.0.0.100 and
private IP address 192.168.0.1.
10.0.0.200 and
private IP address 192.168.0.2.
First, ensure that both machines are on the same public subnet, and
are able to reach each other. In our example, both machines are on the
10.0.0.0/24 subnet.
If you have enabled firewall software, ensure that it allows UDP
traffic destined for port 1694 to reach the server. Note:
Apple's built-in firewall entry for "IP Failover" incorrectly defaults
to TCP traffic for this rule; you must also enable UDP.
Next, connect both servers together via an independent private network. This can be over a second ethernet connection, firewire cable, or other network.
Ensure that the secondary network interface appears below the primary interface in the Network Settings preference pane; this ensures that the machine will only use the private network when the public network is down.
Also ensure that the private network has no DNS information specified. All DNS information should be obtained from the public network.
If you have enabled firewall software, ensure that it allows UDP
traffic destined for port 1694 to reach the server. Note:
Apple's built-in firewall entry for "IP Failover" incorrectly defaults
to TCP traffic for this rule; you must also enable UDP.
On the primary server, edit the /etc/hostconfig file and add a
line containing the **broadcast addressess* of both the public and
private subnets. Using our sample IPs from above, our line looks like:
FAILOVER_BCAST_IPS="192.168.0.255 10.0.0.255"
Save the file and reboot the primary server (or manually start the
IPFailover service using SystemStarter).
The primary should now have a heartbeatd process running, and a
tcpdump listening for port 1694 should show regular traffic
from the server on both its public and private interfaces. If this is
happening, move on to the next step.
On the secondary server, edit the /etc/hostconfig file and add
lines defining the primary server's IP address, and the interface that
should assume the address in the event of a failover. Additionally,
you may specify an e-mail address to send notifications to when a
failover occurs:
FAILOVER_PEER_IP="10.0.0.100" FAILOVER_PEER_IP_PAIRS="en0:10.0.0.100" FAILOVER_EMAIL_RECIPIENT="root@example.org"
The second line's syntax says that the address 10.0.0.100 should
be added to the en0 interface. If your machine has multiple
interfaces, specify the one that should take over the primary server's
address.
Save the file and reboot the secondary server (or manually start the
IPFailover service using SystemStarter).
You should now have a failoverd process running on the secondary
server, ready to take over from the primary. You may test this by
unplugging both network interfaces from the primary (or simply
shutting it down). The secondary server should notice that the server
is unavailable and take over the IP address. Additionally, you should
receive an e-mail notification about the takover.
If you reconnect the primary server, the secondary should notice and relinquish its address within 15 seconds. Again, an e-mail notification is sent to confirm the change.
The process described above handles the takeover of an IP address from a primary server to a secondary one. However, that's all it does; if your secondary server is not already running all the services that the primary uses, then the takeover won't help you.
For this reason, Apple has provided a scriptable framework so you can take specific actions whenever a secondary server takes over (or gives up) the primary's address.
Apple's IPFailover scripts look for a directory in
/Library/IPFailover named after the public IP address of the
primary server. In our example, the secondary server would have a
directory named:
/Library/IPFailover/10.0.0.100
This directory can contain several scripts, outlined below:
Test, which is run before any takeover is attempted. If the
script returns with zero status, the takeover continues; if it returns
non-zero status the takeover is aborted. This allows for conditional
takeover depending on other external factors.
PreAcq.* scripts get run before the primary address is added to
the secondary. Any script starting with the prefix PreAcq is run,
and the ordering is determined by the name of the file (e.g.,
PreAcq-1 would run before PreAcq-2).
PostAcq.* scripts get run after the primary address has been
acquired by the secondary server. Execution rules are the same as
with PreAcq scripts above.
PreRel.* scripts get run just before the secondary server gives
up the primary address. Execution rules are the same as
with PreAcq scripts above.
PostRel.* scripts get run after the primary address has been
relinquished by the secondary server. Execution rules are the same as
with PreAcq scripts above.
You may have as many of each script as you want, and they may perform any scriptable tasks. For example, you might use the scripts to start a service after aquiring the primary's address, and then stop the service when the primary comes back.
We've built a template directory containing simple scripts that are easily customized:
Suffield IPFailover Config Template
It is possible to set pairs of machines up in a peering configuration
such that each acts as the backup for the other. Simply add the
requisite lines to /etc/hostconfig on both machines and they'll
act as a backup for each other.
We keep all of our script directories under version control. To use them, you must check out the repository for the primary host you're interested in onto the secondary server that backs it up.
Note: This document assumes you've set up basic IPFailover as
described above. That means you've edited /etc/hostconfig and
"vanilla" failover is working.
Check out the config from our Subversion repository, substituting the
actual primary IP address for <PRIMARY_IP>:
cd /Library/IPFailover/ sudo svn checkout \ svn://svn.suffieldacademy.org/netadmin/trunk/software/failover/host_configs/<PRIMARY_IP>
This will check out the configuration directory named after the IP
address and place a working copy in the /Library/IPFailover
directory. You may change into this directory at any time and run
svn up to merge in any updates to the configuration.