OpenDarwinÕs Panther Upgrade
The OpenDarwin transition from 10.2 to 10.3 was a painful one. We started out preparing a brand new drive with 10.3 on it in the hopes of doing all the work up front, and having a smooth transition. Unfortunately, things fell apart once the 10.3 drive was installed and booted. 23 hours of configuring, and we believe the system is fully upgraded with minimal data loss.
Our
plan of attack on the Panther transition was to have a spare drive, do a fresh
install of 10.3 Server on it, and migrate everything over to this drive. We first tried 10.2 to 10.3 upgrade
installs with disastrous results.
These upgrade installs were attempted prior to 10.3 GM, and about two
dozen bugs were filed. Most of the
bugs were not fixed for 10.3, and we werenÕt about to try the upgrade on a
production run without known issues being resolved. Moving ahead with the 10.3 fresh install approach, we found
many migration issues. While these
migration headaches arenÕt exactly bugs, they definitely tripped us up quite a
bit. WeÕll get to the migration
issues later. We were not able to
fully replicate the installed machineÕs configuration locally, as there were a number
of outside services depending on this machine. We did not have the resources to replicate the entire
infrastructure built up around this server. This came to haunt us later.
Once
the machine appeared to be fully installed locally, and ready to be put into
production, we scheduled the downtime, and made arrangements with the ISC to
visit the machine. The main
OpenDarwin server is located in a data center in San Francisco. We donÕt have easy physical access to
the machine, so we like to keep the visits to a minimum. When we arrived, we tried to connect up
one of the floating VGA displays to the machine, along with our stored USB
keyboard. As usual, the VGA
display didnÕt work with the Xserve, so we were stuck doing things ÒremotelyÓ
over a wireless network. The
machine was configured to not launch the window server, since there was no need
for the overhead of having GUI processes running. Unfortunately, whatever resolution the console defaults to,
doesnÕt work with the CRTs floating around the data center. So, we shut down the machine remotely,
put the disk in, and had to boot back into the old OS so that we could select
the new disk to boot off of. We
would have liked to be able to use the console to configure the network on the
new system, but that wasnÕt practical, since the Xserve and the CRT didnÕt get
along. Once the system rebooted
into 10.3, we wouldnÕt have a console, and we wouldnÕt be on the net, and that
wouldnÕt be a good situation to be in.
Every interface to configure the network seemed to go through configd,
so chrooting into the 10.3 system and running configuration tools wasnÕt going
to help us. So, we copied
SystemConfigurationÕs preferences.xml file on 10.2 into 10.3Õs
preferences.plist location, crossed our fingers, used bless to select the new
drive for booting, and rebooted.
Fortunately, that worked, and the machine was up on the net, and we were
able to login. The first major
hurdle was crossed. Then, we
decided to do one last quick sync of home directories and mail onto the new
partition. The machine immediately
became unresponsive. Upon
reboot, the machineÕs disk drive light blinked excessively, but refused all
network connections. No one knows
what the machine was doing. We
gave it the usual 45mins it takes to fsck, and it was still grinding away
without a network. So, we power
cycled it. This time, the
machineÕs network was active, but apparently we couldnÕt authenticate to the
machine until it was done fsckÕing our UFS cvs drive. Under 10.2, we were at least able to log in while the cvs
drive fsckÕd. This apparent
regression in 10.3 is most annoying.
Anyway, once it came back up this time, everything seemed fine, and we
were able to continue our syncing.
At this point, the machine was up, on the network, and everyone was
tired of being in the noisy datacenter for 4 hours. The vast majority of our time was spent waiting for the
machine while it was in some unknown state (or fsckÕing).
After
getting the machine onto the net, and remote logins were possible, we went back
to our offices, and continued configuring the machine. Here is where we hit a number of
different migration problems that were difficult to diagnose.
The
sendmail to postfix migration was a serious pain for us. We were used to sendmail, and had it
fairly customized under 10.2. No
one knew anything about postfix, so we had to spend a lot of time coming up to
speed on that. While there isnÕt
really a ÒbugÓ here, Apple could have eased the transition. Perhaps letting us continue to run sendmail,
providing a sendmail.mc to postfix config translator, or even a simple document
describing common migrations.
RedHat had already done a sendmail to postfix transition in their
distribution, and had migration documentation. We ended up mostly relying on RedHatÕs documentation, since
Apple had none. Beyond the normal
transition pain, Apple threw some kinks into the problem. First, we wanted to shut down postfix
until it was configured. That way,
mail would remain queued on the senderÕs machine, rather than getting bounced
or just plain lost in our configuration experiments. However, the ever-helpful watchdog process on OS X Server
refused to let postfix die. Hint:
if the process exits with status 0, it exited because it was asked to. We ended up having to disable
watchdogÕs watching of postfix for this transition. Also, we have found that something keeps writing
configuration options to the end of our /etc/postfix/main.cf. Something on the system REALLY wants us
to be using cyrus for mail delivery.
We donÕt need something that large and complex, especially when weÕre
just trying to understand this new system. So, we have a cron job that periodically checks to see if
anything has mangled our postfix config file, and if so, replaces it with a
known good copy. We currently
believe this only happens on a reboot, but we havenÕt fully diagnosed the
problem yet. We also found that
Apple added a couple of configuration directives. With most postfix configuration options, having them
undefined (not set on or off, just never mentioned in the config file) means
the option is either disabled or uses a default setting. Unfortunately, this is not the case
with AppleÕs options. They must be
configured, and turned off if you donÕt want to use them. Note that this isnÕt really documented
anywhere, you just find out with trial and error.
Once
we got postfix running, we moved on to the next mail issue: mailman. We had a custom install of mailman
before, and wanted to migrate it to the stock 10.3 mailman. Apple seemed to tuck away much of the
mailman directories into /var/mailman and /usr/share/mailman, with a couple of
things in /usr/share/httpd. They
also helpfully provided an /etc/httpd/httpd_mailman.conf file to assist with
the web portion of the configuration.
Once we were able to find all the pieces of mailman spread out over the
system, we began trying to migrate things to it. First, we were using a mailman documented feature of using a
wrapper program run from /etc/aliases that delivered mail to the appropriate
mailing list. This wrapper program
didnÕt exist on OS X. So, we tried
just bypassing the wrapper program and having it invoke mailman directly out of
/etc/aliases. This generated
errors about a gid mismatch.
Postfix was running mailman with gid nobody, and mailman wanted to be
run with gid mailman. Mailman told
us to rebuild mailman to expect the gid nobody instead. It was either that, or have every mail
delivery command be run with gid mailman.
Since it appeared the default install of mailman just wasnÕt going to
work, we went back to having a custom install. We downloaded the source, built, and installed it in less
time than it took to track down all the pieces of the default mailman. We also found a note shipped with the
default 10.3 mailman that said: ÒMailman should run on MacOSX, although I have
not personally had time to try it yet.Ó
After
mail seemed fairly well taken care of, we turned to Apache to get our web
services configured. The only
problem we encountered here, was trying to understand why our httpd.conf
wouldnÕt run. We discovered that
OS X Server came with a cache that injects its self between port 80 and
httpd. We needed to figure out
where to disable this ÒfeatureÓ before Apache would run.
The
perl transition to 5.8.1 was painful as well. OpenDarwin makes extensive use of perl modules, particularly
for bugzilla, our bug tracking system.
Bugzilla requires about a dozen or so perl modules, and none of these
were able to be bought forward to 10.3 due to binary incompatibility between
perl 5.6 and 5.8.1. The binary
incompatibility wasnÕt all.
Cvsweb, a cgi script that provides a web interface to cvs repositories,
had issues with 5.8.1Õs changes to the handling of tainted data. The temporary workaround is to run the
script without perlÕs taint checking, an insecure way to be running a publicly
accessible perl script.
There
were two other big migration headaches we encountered. The bind8 to bind9 transition worked
well, but was completely undocumented by Apple. Our named.conf worked well enough for bind9 to answer
queries, but wouldnÕt allow zone transfers. As a result, none of our slave servers were answering
queries, because they couldnÕt update their zone information. The problem was that our config file
wasnÕt properly working with bind9.
We just needed to update a few entries, and everything worked. A migration document from Apple would
have helped immensely. However,
the only reference to bind on AppleÕs site that we could find was that it was
included in OS X.
The
last issue was the migration of snmpd.
Apparently, the snmpd configuration file moved from /etc/snmpd.conf to
/usr/share/snmp/snmpd.conf. There
wasnÕt any documentation giving us a heads up on this. Also, the new snmpd defaulted to using
SNMPv3, where the old version of snmpd didnÕt even support SNMPv3. This was at least documented in the
snmpd man pages.
Once
the transition from 10.2 to 10.3 was completed, the system seems to be up and
fairly stable. Right now, we donÕt
really notice any gain for all the pain we went through, though. 10.3 does include a case sensitive
version of HFS, which we will benefit from, but after all the disturbance
caused by the rest of the upgrade, we need a break before we rearrange our
disks to accommodate a new filesystem for cvs. For a server, where you donÕt need all the GUI
enhancements, and any upgrades of things like mailman, postfix, or named, can
easily be done on your own, one really needs to question the value provided by
10.3. It was a very painful
transition, with no guidance from Apple, and weÕre canÕt identify any clear
benefits at this point.
If
you do decide to go with Panther Server, hereÕs my advice: 1) buy an Apple
display. Even if youÕre going to
have a headless server, buy an Apple display and lug it up to the machine every
time you want to play with it. 2)
donÕt expect any help from Apple during the upgrade. There is zero migration documents available on AppleÕs
site. Many of your migration
headaches have been had by the RedHat users a year ago, and they have many good
migration documents. They donÕt
work 100%, but at least itÕs something.