Keeping Your Web Content in Sync
Adam Olson
This article is all about keeping the content in your Web server
farm synchronized with rsync. rsync is a very handy program that
provides a simple way to mirror content across a number of machines.
I'll show how to design a straightforward content push system to
keep front-end Web server content synchronized. There are plenty of
ways to utilize a program like rsync; this is just one of them.
Obtaining and Building rsync
The current version of rsync is 2.4.6 and was written by Andrew
Tridgell and Paul Mackerras. Download the compressed source
tar ball at: http://rsync.samba.org. I ran the
following commands on a system running Solaris 2.7, and the
compilation went smoothly:
# gzip -dc rsync-2.4.6.tar.gz | tar xvf -
# cd rsync-2.4.6
# ./configure
# make
# make install
This will install the rsync binary in /usr/local/bin as
well as the man pages. You will need to go through this process on
all the involved hosts.
More on Our Goal
One example of a cookie cutter Web tier is a design where a
number of front-end Web servers all serve up identical content and
the rest is handled via calls to a back-end database of some kind.
Traffic is load balanced across the Web servers using a method such
as DNS round robin or, if possible, a hardware solution. Because the
Web servers all have the same content tree, using rsync to maintain
these structures from a central distribution point provides a clean
and easy way to maintain the content.
More on rsync
Why does rsync work so well in this configuration? Here are some
of the key factors:
1. You can use ssh as the underlying transport mechanism.
This means you get added security without a lot of extra work.
ssh handles all of the authentication which is a lot better
than leaving it up to clear text protocol like rlogin.
2. Entire filesystems or individual directories can be updated,
therefore making it easy to mirror your document root and
subdirectories to a number of destination hosts.
3. It preserves symbolic and hard links, ownership, permissions,
etc. For example, if rsync is preserving file ownership, the UIDs of
the transferred files will remain the same instead of being owned by
the account initiating the transfer.
rsync also includes an algorithm for determining which portions
of a file need to be synchronized, thus it can be more efficient
over slow transmission lines. Personally, I don't usually benefit
from this feature because high bandwidth paths are increasingly more
common. As the following example shows, I am more concerned with the
act of synchronizing our hosts than with the hopes of doing it in
the most efficient manner. If you are interested in learning more
about the rsync algorithm, a detailed description is provided in the
distribution.
Let's Do Some Syncing
I'll now walk through how to build a basic configuration that can
be expanded to support a multitude of hosts. The following is an
example of using ssh to transfer the files. You need
ssh (http://www.ssh.com) installed on both hosts, or
you can use rsh.
The central distribution point will be located on a host named
dev, and our front-end Web server will be on a host named
www1. The distribution root on dev will be located at
/usr/local/webroot, and the document root on www1 will
be located at /usr/local/webroot as well.
The basic command to synchronize www1 to dev looks
like this:
dev# rsync -vrlHpog --delete --rsh=/usr/local/bin/ssh/usr/local/webroot/ www1:/usr/local/webroot/
Here is a break down of this command that shows what each part
does:
- -v -- Run in verbose mode. Displays the files being
transferred, as well as statistics on how much data was written,
read, and how long it took.
- -r -- Recurse into directories.
- -l -- Preserve soft links.
- -H -- Preserve hard links.
- -p -- Preserve permissions.
- -o -- Preserve owner.
- -g -- Preserve group.
- --delete -- This option deletes any files on the
destination host that do not exist on the distribution host. This
is useful because when certain portions of the content have been
deleted in new revisions, unless this option is specified, the
files will linger around on the front-end Web servers. This could
conceivably have bad affects on your application.
- --rsh=/usr/local/bin/ssh -- The path to ssh.
- /usr/local/webroot/ -- The local content source
directory.
- www1:/usr/local/webroot/ -- The remote host and its
local content document root.
Another argument you may use often is --exclude. For
example, adding --exclude="*.log" or --exclude="*.old"
would exclude any file ending in .log or .old from
being pushed to the front-end Web servers. Log files or backups made
while on the development server are of little use when synchronized
into production. For a list of all the arguments to rsync, run rsync
without any arguments or check out the man page.
Sprucing It Up
Typing the command discussed above works well when you are
dealing with only a few front-end Web servers. Even then, it is
always easier to write a script to do it for you! I am always
happier when I have eliminated repetitious tasks.
Here is a basic script that gets the job done. A useful addition,
if you use RSA authentication in your ssh setup, is to add
support for ssh-agent so a passphrase only needs to be
entered once:
#!/usr/local/bin/perl
#
# a basic script utilizing rsync that will synchronize
# content to a number of front end servers.
#
# [email protected] 10/31/00
#
#### DEFINE ####
# array of servers, add your hosts here.
@servers = (www1, www2, www3, www4, www5, www6, www7, www8);
# distribution directory
$distdir = "/usr/local/webroot/";
# destination directory
$destdir = "/usr/local/webroot/";
#### END ####
foreach $server (@servers) {
print "Initiating content synchronization on $server.\n";
system "/usr/local/bin/rsync -vrlHpog --delete \
--rsh=/usr/local/bin/ssh $distdir $server:$destdir";
if ($? == 0) {
print "Content synchronization successful on $server.\n";
} else {
print " Content synchronization failed on $server.\n";
}
} Conclusion
This article covered a relatively painless way to keep the
content on your front-end Web servers synchronized. It can be
expanded upon to synchronize content across a wide area of differing
services, as well. rsync's seamless integration with ssh and
ability to mirror entire directory trees while keeping permissions
and ownership intact, make it a good solution to the problem of
content management.
Adam Olson has helped build a successful ISP
(http://www.humboldt1.com), designed and
configured portions of the California Power Network while working at
MCI WorldCom, and is currently working for a startup in the Silicon
Valley (http://www.quaartz.com). Adam hopes to
be sailing a lot soon. He can be contacted at:
[email protected].
|