Yesterday, the SSD startup drive in my OS X MacBook became extensively corrupted, such that the computer would no longer boot from it. The process of recovering and repairing the drive revealed a number of important lessons related to recovery preparedness.
Startup drive corruption.
Last night, as iTunes was downloading the iPhone iOS 4 update on my OS X MacBook, the disk utility Drive Genius (which continually monitors the health of my drives) began raising alerts related to my startup solid-state device drive. Paraphrasing, the error messages said:
Drive Genius has discovered HFS errors on your startup drive. The unix utility fsck will be run on the next mount of this drive.
I immediately restarted the computer, in order to allow fsck to run, and was comforted to see its progress bar appear on the startup screen. But then, unfortunately, after proceeding just a bit, the progress bar would reset itself, and try again. After three failed repair attempts, fsck gave up, and the computer simply shut down.
Booting from my mirror backup.
My backup procedure involves, in addition to a Time Machine, the nightly maintenance of a bootable external mirror drive, from which I can startup and run disk utilities precisely in situations like this.
As Murphy would have it, my mirror drive had been acting flaky over the past few days, and was actually two days out of sync. Although that in itself would be a bit of a problem should I have to recover data from that drive, my bigger worry was being unable to boot from it. (Should that happen, my lone remaining option would be to boot from a Snow Leopard disk, and rebuild my startup drive from the Time Machine — a very lengthy process.)
Fortunately, I was able to boot from the mirror drive — and I made an immediate note to replace that it ASAP.)
Subsequently, I spent probably 20 minutes acknowledging connection requests from Little Snitch, allowing the multitude of startup applications, menu items, and system preferences to phone home for various reasons (checking for updates, etc.) I also was reminded how excruciatingly slow a USB drive is, compared to an SSD. I found myself really wishing I’d created a special account for such purposes, with no startup apps.
I then noticed Dropbox and Backblaze beginning to do some heavy processing — uh-oh. Given my backup drive was two days out of sync, these cloud-syncing utilities became confused at what they were seeing, as they compared my local filesystem to the sync’d versions in the cloud. This would eventually cause some small problems for me, as described later.
Repairing the startup drive.
The startup drive had invalid node problems that were so severe, that neither Apple’s Disk Utility, nor Drive Genius could repair them. In fact, Disk Utility went so far as to suggest immediate reformatting of the drive.
Fortunately, I also own a copy of Disk Warrior 4, and decided to give it a go. Its progress bar quickly progressed for a while, but then sadly appeared to stop — permanently frozen in position, for at least five minutes. I was actually about to quit the application, but then a Google search revealed that in case of severe disk problems, Disk Warrior can take a long time (up to 12 hours in some cases) to repair the drive. I also found several reports from people who were just about to give up at the frozen progress bar (like me), and then decided to continue waiting, and were rewarded for their patience.
So, I allowed Disk Warrior to carry on, and went off to watch a movie. A few hours later, I checked back, and sure enough, Disk Warrior had fixed the problems, and was waiting for me to confirm its rebuild of the startup drive’s directory.
Up and running again.
Once repaired, I rebooted from the SSD, quickly refreshed my mirror backup, created a second backup on another drive, and let Time Machine run. Confident my data was all safely backed up, I went to bed.
Outdated data problems.
This morning, as I began working with the computer, I started discovering various data problems:
- Things.app data was out of date.
- Yojimbo’s library was out of date.
These applications store their data in Dropbox, which synchronizes to the cloud. Obviously, when I booted from my mirror drive yesterday — which was two days out of date — Dropbox got confused and replaced current files with out of date files (in the cloud).
I was able to recover from this situation, though, because Dropbox was smart enough to keep copies of the “conflicted” data. For example, just next to Things’s “Database.xml” file was “Database (Conflicted Copy from 2010-06-21).xml”. Switching to that file, and I was back on track in Things.app. (Same with Yojimbo.)
To identify similarly affected apps, I used File Buddy to search for all files with “conflicted” in their name. Turns out, though, only Things and Yojimbo had been affected.
I learned a couple of important lessons from this experience:
- Never procrastinate replacing a backup drive that’s beginning to fail. Remember that Murphy is watching us at all times. This time, I got lucky.
- In anticipation of having to boot from your mirror drive, it’s a good idea to maintain an account on your computer into which you’ll login during recovery activity. Ideally, this account will have a minimum number of startup applications, and will have a minimum number of cloud-syncing applications (like Dropbox) running. (Pro tip: If you use Yojimbo to store your serial numbers, you’ll want to keep a fresh copy of its data in ~/Library/Application Support/Yojimbo in that account as well, so that you’ll have access to your app’s registration numbers.)
- As commonly reported, Disk Warrior seems to be the disk repair utility to have around. And it’s important to remember that Disk Warrior can take a long time to fix a bad drive. I’ve heard that it’s very good about telling you when it decides to give up — so, as long as it’s still running, have faith.
- When your startup drive dies, it’s really nice to be able to boot and recover quickly from a mirror drive. Given how cheap disk space is, it’s probably worthwhile to maintain two bootable mirror drives, just in case one proves faulty at just the worst moment.