10.4.1 kernel panics, RAID and data integrity

A few days ago I had my first kernel panic since I bought this machine. I’m fairly obsessive about data integrity because the work I do using this machine is my sole source of income. That means using journalling on the filesystem (from the very first day it was available), storing all of my source in a Subversion repository, burning daily backups of my entire home directory to DVD, running a mirrored RAID configuration, having a UPS in between the computer and the mains outlet, and running Disk First Aid whenever I have an unclean shutdown (euphemism for "hard reset", "irrecoverable freeze", "total system crash" or "kernel panic").

Needless to say I wasn’t too happy when I saw the first kernel panic. It was the first of many. Yesterday I had two more panics, the day before that there was another, and in the wee small hours of this morning I had yet another. All of these panics have occurred since upgrading to Mac OS X 10.4.1. I haven’t yet established a clear pattern but all of them have occurred when using the Finder (for example, double-clicking to launch a file or right-clicking and inspecting the contextual menu).

If you add these kernel panics to the total number of hard resets I’ve been obliged to perform on this machine you get a MTTF (Mean Time To Failure) of about 10 days. (How do I know? Because I keep a log of involuntary reboots… like I said, I’m fairly obsessive.) This is a disappointingly low figure, especially when you take into account that I look after this machine like a baby; it’s a very clean install with the absolute minimum of third-party tools installed. The recent average has taken me back to the "good old days" of Mac OS 7, 8 and 9 where having your machine come crashing down like a meteorite taking all of your unsaved changes in any open application with it was a completely normal (and unstoppable) occurrence.

The troubleshooting saga, phase 1

I’ve barely slept a wink since last night as I’ve tried to troubleshoot and solve this problem. It comes at an inconvenient time because I am trying to prepare a new public preview release of Synergy Advance for release at the end of this month. I have backups so I haven’t lost any work, but right now the machine is unusable (more on this below) and so work is at a standstill.

After kernel panic number four last night I booted off the Tiger install DVD and ran Disk First Aid (DFA) on the RAID. It reported two alarming errors (both "No nombre", "No name" in English), attempted to repair them, automatically re-verified after the repair (something I’ve never seen before) and even though the verification seemed to work informed me that the disk could not be repaired. I ran DFA again and it said the disk was fine.

Upon reboot I got kernel panic number five as soon as I started using the Finder and decided to boot off the DVD again. No DFA errors, but in the RAID tab of the Disk Utility it said the RAID group was "corrupto" (corrupt). I use the word "group" here because that’s what it said in Spanish ("grupo"); I don’t know if in the English version of Disk Utility it employs the word "array". This is somewhat of a chicken and egg syndrome because I cannot know if the kernel panics led to the RAID corruption or the RAID corruption led to the kernel panics; but either way it’s obvious that there’s probably a nasty feedback loop at play wherein every kernel panic gets the RAID into a less consistent state which in turn increases the likelihood of future panics.

In the RAID details tab it said the group was "corrupto" and the first of the two disks in the array was marked as "pausa" (paused, whatever that means). The second disk had no special markings. All three items (the group as a whole and the two disks in the array) were marked with a padlock.

The RAID tab has a little button labelled "Reconstruir" (Rebuild) which supposedly is for use in this type of situation. It didn’t work. After a few seconds the Disk Utility changed the word "pausa" to "error". I tried removing the faulty disk from the array. The "Reconstruir" button was then removed from the user interface.

I tried formatting the faulty disk; it checked out fine. I tried using the "Restore" function to copy the contents from the working disk to the newly formatted disk; fine again, although it did take about two hours. I think the disk is most likely fine at the hardware level, but that it got corrupted at the filesystem level. Despite this, however, the RAID group itself continued to be marked as "corrupto". The Apple docs say that if you have a disk failure you can replace the disk, select one of the other disks in the array, and drag the new disk (on the left) into the list of disks in the RAID (on the right). Unfortunately, the Disk Utility wouldn’t let me perform that particular drag-and-drop; I suspect the padlock indicates that the RAID group is locked and I am not allowed to make any modifications to it. Along the way I tried exiting and relaunching the Disk Utility as well as rebooting from the DVD again but it made no difference.

At this point I was wondering what good the RAID was if there was no way of recovering from a failure of one of the disks.

So what do I do now? There’s a not-very-helpful help button in the Disk Utility user interface which has no effect when you click on it. If the existing RAID group was corrupt but one of the volumes in it was ok, I wanted to know if I could create a new RAID group and add the volume to it without any data loss. The Apple docs seem to indicate that deleting the group and then re-creating it is not an option because they imply that deleting the group deletes all the data on the volumes in the group. They say nothing about what happens when you add a volume to a new (empty) group, but I suspect the answer is that you lose the data on the disk.

So I’ve rebooted off the drive one last time, burnt one last backup of my home folder to DVD, and am in the middle of a 12-DVD backup of the contents of the rest of the drive using Impression. Impression is currently taking between 30 and 60 minutes per DVD. Four down, eight to go. Unfortunately, while this is going on I can’t continue with my work because any changes I make won’t be reflected on the DVDs.

My plan is to finish the back-up and then brave the undocumented waters of Disk Utility and see what happens when I try to add the known good volume to an newly created RAID group. If I lose all my data then I’ll be doing a low-level format and reinstalling absolutely everything from scratch from my backups. All up I estimate this little disaster will cost me well over 24 hours, 24 hours in which I’ll be entirely unproductive. I suppose a clean install will be good to blow away the cobwebs, so to speak, but it will also mean that I have to reinstall all of my software, hunt for install disks and license records, and painfully migrate my personal and work data and settings from the DVD backups back onto the drive. Could end up costing as much as 48 hours, and I don’t mean 48 calendar hours but 48 working hours…

This whole experience has left me questioning the value of RAID as implemented in Mac OS X. I purchased two disk drives and effectively traded half their total capacity for data security, redundancy and recoverability. Looks like I got the redundancy but not the recoverability. I may as well just have had two normal disks and done daily security backups from one to the other. I haven’t given up on Mac OS X RAID just yet, but I am disappointed with it. I just hope that this is doesn’t happen again in the future.

Update

Well, things didn’t work out like I hoped and I ended up spending a lot more time on recovery than I expected. I made notes as I went along and I include them below because perhaps some day they will help someone who finds themselves in a similar situation.

In a nutshell the most valuable lesson learned is the following: If your RAID and/or drives get into such a bad state that Apple’s own tools refuse to touch them then you may end up having to:

  1. Boot into single-user mode (but only if you’ve got a wired keyboard, not a wireless one).
  2. If necessary unplug the disks so that you are only working with one unit at a time.
  3. Use the "kextunload" command to get rid of the AppleRAID kernel extension.
  4. Use the "pdisk" command line utility to blast the partition map.
  5. Zero all disks (do not just delete them).
  6. Be prepared to wait a long time.
  7. Make sure you know your special start-up keys.
Notes

Well, with the backup finally finished (and earlier than expected because Impression is incapable of splitting large files over more than two DVDs; I had to handle the largest file by hand) it was time to try and get this RAID up and running again.

Another boot from the Tiger install DVD and more fruitless clicking in the Disk Utility UI. I finally tried the diskutil command line tool (see "man diskutil" for more info). According to the man page the following command will repair a mirror, assuming that the good disk is /dev/disk1s3 and the disk which is to be inserted into the array to replace the bad disk is /dev/disk0s3:

diskutil repairMirror /dev/disk1s3 /dev/disk0s3

This yielded the following messages on the screen:

Note:  Syncing data between mirror partitions can take a very long time.
Note:  The mirror should now be repairing itself  [sic] You can check it's [sic] status using 'diskutil checkRAID' [sic]

Running "diskutil checkRaid" reported that the RAID status was "degraded" and that the disk corresponding to /dev/disk0s3 was at "0% (rebuilding)". About 90 minutes later it hit 100% and the raid status changed from "degraded" to "online". I fired up Disk Utility and confirmed that it too showed the RAID as being online, and ran DFA again on the RAID and it reported no errors.

Time to reboot. 20 minutes later the Mac was still at the grey startup screen with the Apple logo and the fans were roaring like hair dryers. Looks like the RAID wasn’t doing so well after all. Time for another hard reset… I tried to reboot one more time and the result was the same. After stalling at the grey startup screen for some time (about eight minutes) the fans started to kick in and soon were roaring at full speed with no sign of letting up. I suppose I could have waited around for them to stop but chances are the machine was hung forever. Another hard reset (and another when I tried to reboot from the install DVD but evidently did keep the "C" key pressed for long enough, leading to another stall). Looked like I may have to do a clean install after all.

On rebooting from the install DVD I saw that Disk Utility now reported both volumes in the RAID as "sin conexión" (disconnected). I formatted both disks, called them "Good" and "Evil". Evil refused to format (claiming not enough free space) but I was able to create a partition that occupied the entire disk (thus wiping everything on it). I found that I could then format.

I then created a new RAID and set "RAID mirror AutoRebuild" as specified in the docs, something I didn’t do last time and I wonder if that has anything to do with the problems I’ve had (I also wonder why this box is not selected by default when you choose to create a mirror). It took a long time for Disk Utility hung at the "Bringing RAID partitions online" step. After nearly an hour of indeterminate progress bar and no disk noise at all I decided to abort and try again.

On re-launching the Disk Utility I saw three physical drives in the list when there are in fact only two in reality (System Profiler shows only two). A bad sign. Time for a reboot. It was beginning to look more and more like Apple does not want Wincent Colaiuta to ever use RAID again.

This time I not only wanted to format both disks but I was going to zero them as well, christening them "Upper" and "Lower". Turns out though that I got another endless barber pole and no disk access noises at all so once again decided to abort. The "Omitir" ("Omit") button didn’t work when clicked. Disk Utility started beachballing as soon as I tried clicking on the menu so the only remedy was to try to Force Quit (which naturally didn’t work) and then hard reset (the fourth for the day and the fifth involuntary reboot for the day if you also count the kernel panic).

I didn’t know whether the problem was limited to "Upper" only so I decided to try zeroing "Lower". I got a message: "Error en la escritura de ceros en el disco" ("Error writing zeros to disk"), "La escritura de ceros en el disco ha fallado debido al error: No se ha podido abrir el disco" ("The writing of zeros to the disk has failed due to the error: unable to open disk"). I then tried to create a blank partition on the disk and got "Bad file descriptor" (in English this time). I got the same error again when I tried to format the disk without zeroing. Running DFA on this disk was not an option because the "Repair disc" buttons were ghosted for both the physical drive and the volume on the disc itself.

I then tried formatting my old buddy "Upper"/"Evil" again but only got the barber pole and no disk activity. My old buddy, beachball, soon joined the party. Force Quit didn’t work. Hard reset number five for the day. I decided to forget about using Disk Utility for now and after rebooting from the install DVD again went straight to the command line.

I decided to get familiar with diskutil. I tried "diskutil list", "diskutil destroyRAID /dev/disk2" ("does not appear to be a RAID master"), "diskutil eraseDisk HFS+ Untitled1 disk0" (stalled at "Creating Partition Map"), "diskutil eraseDisk JournaledHFS+ Untitled1 disk0" (also stalled at "Creating Partition Map"), "diskutil eraseDisk JournaledHFS+ Untitled1 disk1" ("Partitioning encountered error Bad file descriptor (9) on disk disk1"), "diskutil partitionDisk disk0 1 JournaledHFS+ Untitled1 10GB" (stalled again), "diskutil partitionDisk disk1 1 JournaledHFS+ Untitled2 10GB", "diskutil repairVolume disk0" ("Error (-9958) encounterd attempting to verify/repair disk0", "diskutil repairVolume disk1" (same error), "diskutil repairVolume disk2" ("Can’t open /dev/rdisk2: Resource busy").

Next avenue: fsck. "fsck -yf /dev/disk0" (stalls forever, no output), "fsck -yf /dev/disk1" (reports "(NO WRITE) BAD SUPER BLOCK: MAGIC NUMBER WRONG. LOOK FOR ALTERNATE SUPERBLOCKS? no"), "fsck -yf /dev/disk2" ("Resource busy").

With "fsck_hfs -yf" I got almost identical error messages and hangs. Time for an attempted reboot, which yielded stall at spinning progress indicator, hard reset number six.

Moving to an even lower-level tool I next fired up "pdisk" (see "man pdisk"). Trying to edit disk 0 warned me, "The map is not writable". Trying with disk 3 hung forever after emitting a "Resource Busy" error message. Trying to reinitialize the partition map returned, "can’t open for writing, resource busy".

Shutdown (hard reset 7) trying to boot in single user mode… funnily, holding Command-S allowed me to boot into single user mode but I then couldn’t type anything. Why did the Bluetooth keyboard work at power on but not after getting to the command prompt?

"You can use a wireless keyboard to start up your Mac in Single-User mode or in Safe Boot mode. Once the computer is started up in these modes, however, the Bluetooth software controller is de-installed and will no longer work. To continue in Single-User mode or in Safe Boot mode you will need to use a wired keyboard and mouse."

I’ve heard the Apple explanation but I don’t buy it. One thing I did end up buying, however, was an Apple corded (USB) keyboard. Hard reset number 8. Having to buy the keyboard just to use single-user mode was annoying because Apple charges full price when you buy a wireless keyboard and mouse from them, yet you don’t get the wired keyboard and mouse that’s included in the cost of the machine; they keep it. (In other words, they remove your corded keyboard and mouse from the box the machine comes in, keep it without offering you a reduction in price, and then they insert the wireless keyboard and mouse in the box, charging you full retail price for it.)

29€ later and I have a new Apple USB keyboard in front of me. Single user mode take two. Typing is a little difficult as the US keyboard map is used by default in single-user mode and this is a Spanish keyboard. Things like "/", "-" and "_" aren’t in the same place. I see a message like this:

AppleRAID::completeRAIDRequest - error 0xe00002ca detected for set "Toledo" (1B18C9E1 … 8046C2534682), Member 79681D5C … 8861ADB48C50, set byte offset = 249925070848

Much to my surprise when I run pdisk again and try to initialize the partition map for disk0 I get the "Resource busy" message again. I try to edit disk1 and the program hangs, I do a Control-z and kill -9 of the pdisk pid but it has no effect. "shutdown -h now" hangs forever. In subsequent tries I must resort to using "halt -nq" or "reboot -nq", which are barely better than yanking the power cord. Hard reset number 9.

I reboot into single user mode again, trying pdisk on disk2 and getting the "Resource Busy" message. I get a hang on trying with disk0. Time for a quick "reboot -nq".

Another reboot, another hard reset (number 10). I now tried disconnecting the drives one at a time and seeing what I could achieve in single-user mode. When I tried disconnecting drive that’s in the upper bay I didn’t see the completeRAIDRequest warning but after a few seconds at the prompt i see: "AppleRAID::degradeSet - starting the set "Toledo" (1B18…)". I tried "diskutil destroyRAID /dev/disk0" (hangs). In fact, no diskutil command works on any disk, all hang, so I issued a "halt -nq".

Plugged upper back in, unplugged lower, rebooted into single user mode. No "completeRAIDRequest" warning shown during boot, but after getting to the prompt saw the "degradeSet" warning and then the "completeRAIDRequest" warning. "diskutil info disk0" hangs. "fsck_hfs -fy /dev/disk0" returns "AppleRAID: terminating set "Toledo" (1B18…)" and "Volume check failed". After a while, saw the degradeSet and the completeRAIDRequest warnings again.

Finally I tried "kextunload -c AppleRAID" and it responded that "terminate instances for class AppleRAID succeeded". At that point pdisk let me overwrite the partition table! Unplug, plug in the other, kextunload and overwrite partition table for that disk. Shutdown, plug in again, reboot from DVD.

Things are starting to look good. I’ve finally managed to drag Apple’s tools, kicking and screaming, to the point where they’ll actually let me erase my disks. But the good times won’t last for long. I try to create RAID and it tells me "La creación del grupo RAID ha fallado debido al error: No se ha podido crear una nueva fracción" ("The creation of the RAID group failed due to error: Unable to create new fragment"). I try to format the disk and write zeros to it but get "No se ha podido abrir el disco" ("Unable to open disk"). I try to zero the second disk and it seems to work (although it takes a long time).

I was still getting assorted errors on first disk ("Bad file descriptor" and the like) so I went back to the command line, "diskutil eraseDisk" (works), "diskutil zeroDisk" (works). I was finally able to format both disks, create a new RAID, and DFA reports they are both fine, rebooted and DFA still reports that they are fine.

But Apple had one more surprise in store for me. I rebooted from DVD and learned that the copy of Tiger that I received in return for my annual $500 membership fee is not a full install disk but is actually an "upgrade" DVD. So Apple wants me to install Panther first. But how to eject the DVD? It won’t let me eject the DVD because it is the startup disk, and it won’t let me shut down the computer either; for some reason all the menus offer me is "Restart".

I couldn’t remember the key to hold pressed to get a startup menu and neither could I look it up on the net using my unbootable machine so I rebooted as "NetBoot" client, thinking that when it failed to boot it would hopefully show me a menu or at least let me eject the DVD. I held the eject button, the disk popped out and while I was looking for another boot disk the tray shut.

This was Apple’s one last joke at my expense. Now I was at a grey screen with a slash symbol on it, the fans started to roar like fucking crazy. The eject key does not work. I will have to hard reset. Again. Number 11 and the first for my brand spanking new RAID.

At this point Command-Period does not open the tray and the fans are getting louder and louder. I really should have started up by holding the Option key. What do I do now? Need I fear corruption of the new RAID due to the fact that I have yet another hard reset? Do I take a chance or do I reformat all over again? Zero all over again? Slash my wrists? Apple provides no way to open the tray that I know of and no way to get the system to restart other than hard resetting.

I don’t want to take any chances. I don’t want this to happen again in the future and I’ll forever wonder if it was because I allowed the corruption to take place at the very beginning. Hard reset. Startup with option key. Documented feature (Command-Period) to open tray does not work. Miraculously, the eject button does work. Boot from DVD. Terminal.

diskutil destroyRAID disk2
diskutil zeroDisk disk0
diskutil zeroDisk disk1
diskutil partitionDisk disk0 1 JournaledHFS+ Virgen0 10GB
diskutil partitionDisk disk1 1 JournaledHFS+ Virgen1 10GB

Run DFA on both, create RAID "Toledo", run DFA on RAID, reboot, run DFA on RAID, install Tiger from the retail CD instead of the ADC one. Reinstall all apps. Restore files from backups. This whole thing has now cost me about 48 hours. I really hope it doesn’t happen again.

More fun

Monday 30 May 2005, 2:10 AM: The fun isn’t over yet.

Well, looks like all my efforts were for naught.

After spending several hours restoring from backups, reconfiguring and reinstalling I have tried to reboot the computer because some of the installed software required it and as I write this it is at the blue screen with the spinning progress indicator (it’s now been at that stage for about 30 minutes). So it looks like I have a hardware problem after all, although I don’t know which disk is at fault. I am now into my 72nd hour of this ordeal.

I did do a small amount of coding today which I could have lost, and I could also lose much of the email that I received while I was trying to restore the computer today. Evidently that’s something that I’d like to avoid if possible.

The UI provides no feedback at this point but my guess is that the RAID is corrupt and the system is trying to repair it or at least operate in a degraded state. I will leave the computer running – thankfully for once the fan is not roaring like a blow heater – all night if need be. If it doesn’t recover eventually I will have to do another hard reset and try booting with one of the drives unplugged. [Later on I realized it wasn’t a RAID problem but a software problem, possibly caused by Virtual PC.]

If I can boot successfully with one drive but not the other I will know which one is faulty and remove it from the machine permanently. I hope that I can recover the files that were modified today. [I even went so far as to start looking up hard drive prices, but now it once again looks like a software issue.]

After waiting three hours at the spinning progress indicator with no audible disk access I decided to hard reset.

Power on with option key held down. Command-Period, the Eject button and left mouse all fail to open tray. Tried plugging in corded keyboard, same result. Hard reset #2.

Power on with option key held down. Both disks in RAID set appear in the boot menu (as "Toledo 1" and "Toledo 2"). This time seeing as the corded keyboard was plugged in the whole time the Eject button did work (question: why does Apple allow the Bluetooth keyboard’s option key to work at power on but not the eject key?)

Boot from Tiger install DVD. Run Disk Utility. RAID array is marked as being "Online" and no warnings shown. Run DFA on RAID (buttons are ghosted for the individual drives so can only run it on the RAID as a whole): DFA reports no errors. Also ran it on the virtual RAID volume: no errors reported.

Fire up Terminal. Look at "/Volumes/Toledo/private/var/log/system.log". Some interesting entries from the the just before the failed boot attempt:

21:31:37 kernel[0]: jnl: flushing fs disk bugger returned 0xd
21:31:37 kernel[0]: disk7s2: device is write locked.
21:31:37 kernel[0]: jnl: do_jnl_io: strategy err 0xd
21:31:37 kernel[0]: jnl: write_journal_header: error writing journal header!

I hope this isn’t indicative of grave problems with the disk.

21:41:04 shutdown: reboot by wincent:
21:41:04 SystemStarter[19381]: "/Library/StartupItems" failed security check: permissions
21:41:04 SystemStarter[19381]: "/Library/StartupItems" failed security check: not owned by GID 0
21:41:05 SystemStarter[19381]: authentication service (19389) did not complete successfully
21:41:07 SystemStarter[19381]: The following StartupItems failed to properly start:
21:41:07 SystemStarter[19381]: /System/Library/StartupItems/AuthServer

The "start" above is spurious because it was actually shutting down at the time I think…

Open "Startup Disk". Use "Shutdown" menu item (so there is a way to shutdown rather than restart using the UI). Power on, hold Shift key after startup chime to force booting in "safe mode". Boots. Warnings about permissions on Virtual PC items installed in /Library/StartupItems/ (thanks Microsoft).

After correcting permissions attempt reboot (not in safe mode). Boots. Gets to Finder. Autolaunched apps start to bounce in Dock. I try to drag a Terminal window. Kernel panic #1.

Boot from Tiger DVD. Disk Utility. RAID shows as "online". DFA on RAID and on virtual volume. No errors reported.

Reboot with Shift key held (safe mode).

What is causing the problem? Virtual PC? USB Overdrive? AlchemyTV DVR? (all seemed to work fine on 10.4)

Upgrade Virtual PC from 7.0 to 7.0.1. Uninstall USB Overdrive. Reboot. More warnings produced by Microsoft Virtual PC about inappropriate permissions (repair). Remove items added automatically to login items by installers without permission.

Reboot. Fingers crossed that there will be no more kernel panics or freezes. I think for the next few days I’ll be rebooting off the install DVD at least once per day and checking that the disks are error-free.

3:33 AM. The fun never ends. Another kernel panic on clicking on the Interface Builder icon in the Dock (attempting to launch it). On reboot a second kernel panic occurred (and the third for the evening) before I could even start to use the Finder. Ironically all this occurred while I was midway through a backup of my home directory. This truly stopped being funny a while ago; I’m doing my very best to repair this problem but it keeps striking the machine down along with my efforts to not only work but protect that work via backups. Not even OS 9 was ever this bad.

Rebooted from the Tiger install DVD. Trying to use DFA to verify the RAID yields this in the log:

"Nombre no válido" (in red, twice)
"Error: La tarea subyacente ha registrado un fallo a la salida" (red)
"El volumen requiere reparación"

And a sheet drops down which says:

Error de Primera Ayuda Utilidad de Discos has detenido verificando "Toledo" porque se ha producido el siguiente error: La tarea subyacente ha registrado un fallo a la salida

Tried again, this time a repair (not a verify). Same "Nombre no válido" errors, repairs, automatically verifies afterwards, then says, "No se ha podido reparar un volumen".

Try repair again. Said, "No se requerían reparaciones". RAID continues to be marked as "Online".

Just to be safe run repair on virtual volume as well. No errors reported.

Reboot in safe mode (Shift key held down). Works.

Reboot in normal mode. Works. Spotlight informs me that it will be unavailable for "approximately three hours" because it claims that it needs to reindex my drives.

This really does take me back to the OS 9 days in more ways than one. We’re not only talking about constant crashes but perhaps even more importantly, living in constant fear of crashes; so this means that you don’t run multiple apps at a time for fear of conflicts and crashes, you don’t leave things running unless they are absolutely necessary, and you hesitate to even touch your data for fear that the machine will crash will you are trying to work on it.

Well, it’s now 4 AM and I am rather tired of all this. Start back up (again) and hope it actually completes before the next kernel panic. Bed time.