OpenDarwinÕs Panther Upgrade

 

            The OpenDarwin transition from 10.2 to 10.3 was a painful one.  We started out preparing a brand new drive with 10.3 on it in the hopes of doing all the work up front, and having a smooth transition.  Unfortunately, things fell apart once the 10.3 drive was installed and booted.  23 hours of configuring, and we believe the system is fully upgraded with minimal data loss.

            Our plan of attack on the Panther transition was to have a spare drive, do a fresh install of 10.3 Server on it, and migrate everything over to this drive.  We first tried 10.2 to 10.3 upgrade installs with disastrous results.  These upgrade installs were attempted prior to 10.3 GM, and about two dozen bugs were filed.  Most of the bugs were not fixed for 10.3, and we werenÕt about to try the upgrade on a production run without known issues being resolved.  Moving ahead with the 10.3 fresh install approach, we found many migration issues.  While these migration headaches arenÕt exactly bugs, they definitely tripped us up quite a bit.  WeÕll get to the migration issues later.  We were not able to fully replicate the installed machineÕs configuration locally, as there were a number of outside services depending on this machine.  We did not have the resources to replicate the entire infrastructure built up around this server.  This came to haunt us later.

            Once the machine appeared to be fully installed locally, and ready to be put into production, we scheduled the downtime, and made arrangements with the ISC to visit the machine.  The main OpenDarwin server is located in a data center in San Francisco.  We donÕt have easy physical access to the machine, so we like to keep the visits to a minimum.  When we arrived, we tried to connect up one of the floating VGA displays to the machine, along with our stored USB keyboard.  As usual, the VGA display didnÕt work with the Xserve, so we were stuck doing things ÒremotelyÓ over a wireless network.  The machine was configured to not launch the window server, since there was no need for the overhead of having GUI processes running.  Unfortunately, whatever resolution the console defaults to, doesnÕt work with the CRTs floating around the data center.  So, we shut down the machine remotely, put the disk in, and had to boot back into the old OS so that we could select the new disk to boot off of.  We would have liked to be able to use the console to configure the network on the new system, but that wasnÕt practical, since the Xserve and the CRT didnÕt get along.  Once the system rebooted into 10.3, we wouldnÕt have a console, and we wouldnÕt be on the net, and that wouldnÕt be a good situation to be in.  Every interface to configure the network seemed to go through configd, so chrooting into the 10.3 system and running configuration tools wasnÕt going to help us.  So, we copied SystemConfigurationÕs preferences.xml file on 10.2 into 10.3Õs preferences.plist location, crossed our fingers, used bless to select the new drive for booting, and rebooted.  Fortunately, that worked, and the machine was up on the net, and we were able to login.  The first major hurdle was crossed.  Then, we decided to do one last quick sync of home directories and mail onto the new partition.  The machine immediately became unresponsive.   Upon reboot, the machineÕs disk drive light blinked excessively, but refused all network connections.  No one knows what the machine was doing.  We gave it the usual 45mins it takes to fsck, and it was still grinding away without a network.  So, we power cycled it.  This time, the machineÕs network was active, but apparently we couldnÕt authenticate to the machine until it was done fsckÕing our UFS cvs drive.  Under 10.2, we were at least able to log in while the cvs drive fsckÕd.  This apparent regression in 10.3 is most annoying.  Anyway, once it came back up this time, everything seemed fine, and we were able to continue our syncing.  At this point, the machine was up, on the network, and everyone was tired of being in the noisy datacenter for 4 hours.  The vast majority of our time was spent waiting for the machine while it was in some unknown state (or fsckÕing).

            After getting the machine onto the net, and remote logins were possible, we went back to our offices, and continued configuring the machine.  Here is where we hit a number of different migration problems that were difficult to diagnose.

            The sendmail to postfix migration was a serious pain for us.  We were used to sendmail, and had it fairly customized under 10.2.  No one knew anything about postfix, so we had to spend a lot of time coming up to speed on that.  While there isnÕt really a ÒbugÓ here, Apple could have eased the transition.  Perhaps letting us continue to run sendmail, providing a sendmail.mc to postfix config translator, or even a simple document describing common migrations.  RedHat had already done a sendmail to postfix transition in their distribution, and had migration documentation.  We ended up mostly relying on RedHatÕs documentation, since Apple had none.  Beyond the normal transition pain, Apple threw some kinks into the problem.  First, we wanted to shut down postfix until it was configured.  That way, mail would remain queued on the senderÕs machine, rather than getting bounced or just plain lost in our configuration experiments.  However, the ever-helpful watchdog process on OS X Server refused to let postfix die.  Hint: if the process exits with status 0, it exited because it was asked to.  We ended up having to disable watchdogÕs watching of postfix for this transition.  Also, we have found that something keeps writing configuration options to the end of our /etc/postfix/main.cf.  Something on the system REALLY wants us to be using cyrus for mail delivery.  We donÕt need something that large and complex, especially when weÕre just trying to understand this new system.  So, we have a cron job that periodically checks to see if anything has mangled our postfix config file, and if so, replaces it with a known good copy.  We currently believe this only happens on a reboot, but we havenÕt fully diagnosed the problem yet.  We also found that Apple added a couple of configuration directives.  With most postfix configuration options, having them undefined (not set on or off, just never mentioned in the config file) means the option is either disabled or uses a default setting.  Unfortunately, this is not the case with AppleÕs options.  They must be configured, and turned off if you donÕt want to use them.  Note that this isnÕt really documented anywhere, you just find out with trial and error.

            Once we got postfix running, we moved on to the next mail issue: mailman.  We had a custom install of mailman before, and wanted to migrate it to the stock 10.3 mailman.  Apple seemed to tuck away much of the mailman directories into /var/mailman and /usr/share/mailman, with a couple of things in /usr/share/httpd.  They also helpfully provided an /etc/httpd/httpd_mailman.conf file to assist with the web portion of the configuration.  Once we were able to find all the pieces of mailman spread out over the system, we began trying to migrate things to it.  First, we were using a mailman documented feature of using a wrapper program run from /etc/aliases that delivered mail to the appropriate mailing list.  This wrapper program didnÕt exist on OS X.  So, we tried just bypassing the wrapper program and having it invoke mailman directly out of /etc/aliases.  This generated errors about a gid mismatch.  Postfix was running mailman with gid nobody, and mailman wanted to be run with gid mailman.  Mailman told us to rebuild mailman to expect the gid nobody instead.  It was either that, or have every mail delivery command be run with gid mailman.  Since it appeared the default install of mailman just wasnÕt going to work, we went back to having a custom install.  We downloaded the source, built, and installed it in less time than it took to track down all the pieces of the default mailman.  We also found a note shipped with the default 10.3 mailman that said: ÒMailman should run on MacOSX, although I have not personally had time to try it yet.Ó

            After mail seemed fairly well taken care of, we turned to Apache to get our web services configured.  The only problem we encountered here, was trying to understand why our httpd.conf wouldnÕt run.  We discovered that OS X Server came with a cache that injects its self between port 80 and httpd.  We needed to figure out where to disable this ÒfeatureÓ before Apache would run.

            The perl transition to 5.8.1 was painful as well.  OpenDarwin makes extensive use of perl modules, particularly for bugzilla, our bug tracking system.  Bugzilla requires about a dozen or so perl modules, and none of these were able to be bought forward to 10.3 due to binary incompatibility between perl 5.6 and 5.8.1.  The binary incompatibility wasnÕt all.  Cvsweb, a cgi script that provides a web interface to cvs repositories, had issues with 5.8.1Õs changes to the handling of tainted data.  The temporary workaround is to run the script without perlÕs taint checking, an insecure way to be running a publicly accessible perl script.

            There were two other big migration headaches we encountered.  The bind8 to bind9 transition worked well, but was completely undocumented by Apple.  Our named.conf worked well enough for bind9 to answer queries, but wouldnÕt allow zone transfers.  As a result, none of our slave servers were answering queries, because they couldnÕt update their zone information.  The problem was that our config file wasnÕt properly working with bind9.  We just needed to update a few entries, and everything worked.  A migration document from Apple would have helped immensely.  However, the only reference to bind on AppleÕs site that we could find was that it was included in OS X.

            The last issue was the migration of snmpd.  Apparently, the snmpd configuration file moved from /etc/snmpd.conf to /usr/share/snmp/snmpd.conf.  There wasnÕt any documentation giving us a heads up on this.  Also, the new snmpd defaulted to using SNMPv3, where the old version of snmpd didnÕt even support SNMPv3.  This was at least documented in the snmpd man pages.

            Once the transition from 10.2 to 10.3 was completed, the system seems to be up and fairly stable.  Right now, we donÕt really notice any gain for all the pain we went through, though.  10.3 does include a case sensitive version of HFS, which we will benefit from, but after all the disturbance caused by the rest of the upgrade, we need a break before we rearrange our disks to accommodate a new filesystem for cvs.   For a server, where you donÕt need all the GUI enhancements, and any upgrades of things like mailman, postfix, or named, can easily be done on your own, one really needs to question the value provided by 10.3.  It was a very painful transition, with no guidance from Apple, and weÕre canÕt identify any clear benefits at this point.

            If you do decide to go with Panther Server, hereÕs my advice: 1) buy an Apple display.  Even if youÕre going to have a headless server, buy an Apple display and lug it up to the machine every time you want to play with it.  2) donÕt expect any help from Apple during the upgrade.  There is zero migration documents available on AppleÕs site.  Many of your migration headaches have been had by the RedHat users a year ago, and they have many good migration documents.  They donÕt work 100%, but at least itÕs something.