Recovering from a Boot Disk Failure with Solstice DiskSuite

Introduction

Recovering from a failed boot disk is not a very difficult procedure using Solstice DiskSuite when the system has been properly setup and documented initially. Recovering from failures of this nature are good examples of the power of DiskSuite, even in a simple two disk mirroring situation. For the purposes of this demonstration, we will use the same setup presented in Mirroring Disks with Solstice DiskSuite.

Recovery

Booting From the Secondary Disk

The first step is obviously to identify which piece of hardware failed. In this example it will be the boot disk, which is is /dev/dsk/c0t0d0. Once the failed disk has been identified it is important to boot up the system from the second half of the mirror before the failed device is replaced. Booting the second half of the mirror can be a simple procedure or a complicated one, depending on how the system was configured initially. If an alternate boot device had been defined as an nvalias simply boot off of that device:

ok boot altdisk

If the alternate boot alias was not created, it may be possible to boot off of one of the built-in alternate devices provided in the Open Boot PROM. These are numbered disk0 through disk6 and generally only apply to disks on the system's internal SCSI controller. If all else fails, use probe-scsi-all to determine the device path to the secondary disk and create an alias to boot from.

Deleting Stale Database Replicas

When the system comes up it will complain about stale database replicas and will only allow you to boot into single-user mode. Technically, the system cannot boot without the minimum number of database replicas. Since we have lost an entire disk with half of the total number of replicas, DiskSuite does not have enough information to boot. We must log in to the system in single-user mode and delete the database replicas that were on the failed disk. Use the metadb command without any arguments to list the replicas and which have failed. Then delete the stale replicas using metadb -d, as follows:

# metadb
# metadb -d /dev/dsk/c0t0d0s3
# metadb -d /dev/dsk/c0t0d0s5
# metadb -d /dev/dsk/c0t0d0s6

The system can now be shutdown and rebooted with the new disk installed. Remember, you will still need to boot from the second half of the mirror.

Partioning the Replacement Disk

The first task once the system is back up is to partition the replacement disk. This disk must be partitioned identical to its mirror. While there are several ways of doing this, the simplest by far is using prtvtoc to print out the volume table of contents (VTOC) of the good disk, and then using that with the fmthard command to write the table to the new disk. An example is below:

# prtvtoc /dev/rdsk/c0t1d0s2 > /tmp/format.out
# fmthard -s /tmp/format.out /dev/rdsk/c0t0d0s2

Recreate Database Replicas

Once the new disk has been partitioned, the state database replicas deleted earlier must now be recreated. These are created in the exact same way they were created originally.

# metadb -a /dev/dsk/c0t0d0s3
# metadb -a /dev/dsk/c0t0d0s5
# metadb -a /dev/dsk/c0t0d0s6

Detach and Clear Failed Submirrors

The failed submirrors must now be detached from the mirror. Detaching the failed submirrors stops any kind of read or write operations to that half of the mirror when activity occurs on the metadevice. Since the submirrors to be detached are reported in an error state, the detach must be forced, as follows:

# metadetach -f d0 d10
# metadetach -f d1 d11
# metadetach -f d4 d14
# metadetach -f d7 d17

Once the failed submirrors have been detached they need to be cleared. This will allow us to later reinitialize the submirrors and reattach them to the metadevice.

# metaclear d10
# metaclear d11
# metaclear d14
# metaclear d17

Initialize New Submirrors

Now that the failed submirrors have been detached and the metadevices cleared, it is possible to recreate the submirrors and reattach them to the metadevices, causing an immediate resynchronization of the submirrors. Recreating and reattaching the submirrors uses exactly the same steps as creating them in the first place. In our example, we will do the following:

# metainit d10 1 1 c0t0d0s0
# metainit d11 1 1 c0t0d0s1
# metainit d14 1 1 c0t0d0s4
# metainit d17 1 1 c0t0d0s7

# metattach d0 d10
# metattach d1 d11
# metattach d4 d14
# metattach d7 d17

The status of the resynchronization process can be monitored using the metastat command. Resynching the submirrors can take a long time depending on the amount of data on the partitions. Once resynchonization is completed the system can be rebooted. Ensure that the system is now properly booting from its primary bootdisk and that everything is operating normally.

# metastat | more
# init 6

Conclusion

DiskSuite provides robust tools for mirroring disks, including the critical boot disk, which usually includes the / (root), /usr, and swap partitions necessary for the proper operation of the system. Working with metadevices can be a daunting task, especially if you do not work with DiskSuite on a daily basis. Many administrators are faced with the scenario of setting up mirroring on a server, not working with it once it is running, and then having to deal with recovery many months down the road. Sometimes even worse than this is having to deal with recovery of a system that you didn't configure in the first place. The key to a successful recovery is having the proper system documentation in the first place and outlining an exact procedure on paper before beginning the recovery process. Hopefully the example presented above will give most administrators the framework for quickly modifying the procedure for their individual situation.