Experiences Administering a Mac OS X Machine
The purpose of this document is to share our experiences administering remote, headless Mac OS X machines. If youÕve ever administered other unix machines, Mac OS X can present unique administrative challenges.
Our physical arrangement of hardware has one Xserve (Dual 1Ghz g4Õs, 2GB RAM), in a remote data center, and a PowerMac (Dual 1Ghz, 1.5GB RAM) in a separate, more easily accessible hosting facility. Both machines are generously hosted by the Internet Software Consortium. The ISC has donated rack space, power, bandwidth, and excessive amounts of time to the project. Thanks for the reboots, guys! The Xserve is the primary host, providing more or less all of our services. This machine has the most bandwidth accessible to it. The PowerMac is a secondary machine providing some backup services and shell access to project members.
For software selection, both machines are running OS X Client. There were a couple of reasons for choosing over the OS X Server product. First, we just didnÕt need OS X ServerÕs features. WeÕre all unix admins here, and we donÕt really need some fancy GUI to administer machines, right? GUI interfaces arenÕt the only feature that OS X Server offers over the Client installation, but it is a highly publicized one. The other features, such as OpenDirectory, the ability to host more user records, etc. were not needed. WeÕre providing content here, and we only have two machines with a small number of users. We didnÕt need the added complexity of a sledge hammer sized solution, when a little tack-hammer would suffice. There are other features that OS X Server provided that we miss, such as OS X ServerÕs ŌwatchdogĶ. This feature uses the machines Power Management Unit to automatically reboot the machine after a set period of time, should it panic. We thought Ōhey, the machine shouldnÕt go down all that often, rightĶ? Boy, were we wrong. This is a sorely missed feature, although we were able to make up for it with other hardware (less than the price difference between Client and Server, I might add).
The services the machines provide are a cvs repository with anoncvs pserver access and ssh for cvs commits, a web server running apache, a mail server running sendmail, SSL imap, DNS running BIND, and rsync access to the cvs repository and the provided downloads. The cvs repositories host out of a UFS volume, since the project needs case sensitivity for itÕs source files. Overall, the Xserve hosts roughly 3GB of cvs data, and about 9GB of files for http download, and averages about 5Mb/sec of outgoing traffic. ItÕs a modest size project. The PowerMacÕs main roll is to provide shell access and a development environment to the projectÕs members. It has very little bandwidth requirements, and only provides secondary DNS beyond shell access.
Earlier, it was mentioned that OS X provided some unique administrative challenges. WeÕre not going to talk about minor nits here, since every OS out there has its quirks on how to do certain things. This means no discussion of using NetInfo vs. flat files or anything of the kind. WeÕre sticking to fundamental issues like simply being able to provide services.
First, the machine really wants to have a monitor connected to it at all times. For an Xserve in a data center, thatÕs just not possible. To work around this, we found a company, Dr. Botts, which provides little SVGA pass-through devices that fools the machine into thinking it has a display connected to it. These are invaluable, since if the machine doesnÕt boot with a display, itÕs tough to connect one afterwards. Also, in addition to these, a VNC server has been exceptionally useful. There are some configuration that really wants to be done from the GUI, and as was mentioned earlier, the machine is remote. Using VNC over ssh provides us with a reasonable level of security. Much more than using something like Apple Remote Desktop, and we found VNC much more reliable as well. Be sure to disable auto-login, which is on by default, otherwise the screensaver will kick in and consume ~10-20% of the cpu. Also, disable screen blanking (thereÕs no head connected anyway), otherwise when the machine panics, you wonÕt be able to get a backtrace for debugging.
Next, we found that most of the provided software we cared about needed to be rebuilt from source. Cvs was a horribly outdated version with security holes that Apple has not issued updates for and sendmail does not have SASL SMTP AUTH support. Apple has provided security patches for other software, however the updates are at best 5 days behind the advisory. ThatÕs 5 days for the box to get rooted, so we have always built our own patched versions far before Apple has released updates. Beware of updates though, since they will overwrite your updated binaries, and move aside or blow away your config files. WeÕve developed scripts to run before and after applying updates to make sure our config files are still in tact.
Once the machine is up and running, we ran into some serious reliability issues. Over the past 5 months, the two machines have each averaged an uptime of exactly 30 days. ItÕs been a combination of panics and just plain wedging that weÕve encountered, but debugging is difficult since there are no other macs on the local net to attach with, and debugging a ŌsupportedĶ OS is frequently the last thing on our mind when the machine is down and the project is inaccessible. So, roughly every 15 days we have to go reboot one of these machines. WeÕre not using OS X Server with its little watchdog, so that means a trip to the data center each time. WeÕre using some rather large UFS volumes for the CVS repositories, so each time a machine goes down, we have about 50 minutes of fsck time before the machine becomes fully functional again. This is pretty unreasonable, having (at a minimum) 2 hours of down time every month, not including the trip to the data center to reboot the machine. To cut down on the trips to the data center (in admin time, and gas, those trips get rather expensive), weÕve bought a remote power strip to be able to power cycle the machine remotely. These come with a little watchdog timer of their own, to automatically power cycle the machine if it stops responding to requests after a pre-determined amount of time. However, this still hasnÕt solved the fsck problem. Our choices are to 1) not allow case collisions in filenames in the cvs repository, 2) accept the extra downtime as acceptable punishment for wanting case sensitive file names like every other unix, or 3) run a different OS. The last is being strongly considered, as the first two are not desireable.
When the machine isnÕt down or fsckÕing, we have noticed serious performance issues with our services. Since we are running the Client version of the OS, we did have to do a lot of manual performance tuning to get reasonable performance as a server platform. We set all the right sysctls, have balanced the filesystems across multiple disks to get the best IO throughput, etc. However, cvs checkouts, cvsups, and rsyncs were all taking a really long time. It is widely known that UFS on OS X is slow, but this was ridiculous. Rsync takes on the order of 4-5 hours to build the list of files to transfer, and we only have roughly 270,000 files in the cvs repository. The same operation on a PII/400 Linux box with UDMA/33 ATA drives was taking on the order of 10-15 minutes. Cvs and cvsup were equally slow, even for remote home users. We did some investigation of the cvsup case, since many more users were using cvsup than rsync to access the repository. The vast majority (~81%) of cvsupÕs operations were stat(). This appeared to be a real-life manifestation of lmbench reporting obscenely slow stat() times on OS X. The majority of all the processes were small IOÕs, doing stat(), or getdirentries() were the two most common operations. The system averaged, over 2 days of normal operations, about 5-7KB per IO transaction, as reported by iostat. It appears that OS X really does not perform well with lots of little IO operations, and this kills us because thatÕs all our services do. We have done some performance tuning of the kernel, and customizing of sources, but weÕre gaining small percentages, not orders of magnitude.
For backups, you really only have a few options:
We chose to cross our fingers, and backup the running system. ItÕs a horrible way to run a server, but the other three options just werenÕt practical. In practice, weÕve recreated the system on another machine from these backups and havenÕt noticed any serious problems. However, weÕve also not had to recover from catastrophic failure yet, either.
So, aside from the poor uptime, the long fscks, the slow performance, updates nuking your changes, not having reliable backups, and a few other minor nits, we havenÕt had too many serious problems using OS X as a server. Our project is rather modest, and doesnÕt need to store or push too much data too quickly, so the performance issues are more of an annoyance, and currently not limiting the projectÕs growth. Our projects developers have been able to create scripts, hacks, or other solutions to the updates problems and most of the other Ōminor nitsĶ mentioned. The only real issues we have at the moment are the poor uptime, and the inability to perform reliable backups.