LCOD – 7.7.10 – Rebuilding and checking a Linux software RAID array


Often times you need to recover a disk manually, as the automatic disk check (fsck) is not willing to risk data deletion. I have yet to find an instance where answering no to it’s (fsck’s) many requests to fix things has helped in getting data back, or in getting a server back up and running.

With that in mind, I pretty much always run fsck -y on a filesystem when it’s unwilling to mount due to an unclean mount (hard reboot/power failure/etc). This has always been done with ext3 or resierfs journaling file systems, and I’ve yet to notice any data loss, although the risk always exists.

As with any tip online, you proceed at your own risk.

So the standard procedure to recover a disk that won’t fsck on it’s own is simple, during it’s boot process it detects the drive was unmounted uncleanly (this is the default, and every ext2 or 3 drive is marked unclean when it is mounted, and only marked unclean when it is unmounted, the theory being that if it’s not unmounted then there is potentially lost or corrupt data.), and it starts running fsck on it. It will usually give you a progress indicator, and if fsck finds any problems it drops you to a shell, where you can manually repair the disk, with fsck -y /dev/XXX.

However, it seems, that often if the drive is a Linux software RAID partition, then it simply reboots, and trying to boot into single user mode, or with the drive in read only mode, all fail, and the system either loops a failing fsck, or reboots, or both, ad nausea m.

To the rescue is any Linux live CD. Lately, I prefer a Ubuntu install in live mode, but previously to that I really liked Knoppix. Boot it up, get to a command line, and follow these steps.

Overview is :

  1. Boot up and get a root command line, (sudo su / su / single user mode)
  2. Scan the RAID devices and build mdadm.conf file
  3. Assemble each RAID device
  4. fsck -y each RAID device
  5. Reboot into normal / non-live CD mode
  6. Success!

Most newer Linux store mdadm.conf in /etc/mdadm/mdadm.conf, some will have it in /etc/mdadm.conf, and some simply won’t have it.

Use ls to find which one your live CD is using, and then make sure to redirect the output there. This will assume it’s /etc/mdadm/mdadm.conf

Once booted, at the command prompt, as root, type (assuming you’re using either RAID-0 or RAID-1, and you have 3 partitions)

modprobe md
modprobe raid1
modprobe raid0
mdadm --examine --scan >> /etc/mdadm/mdadm.conf
mdadm /dev/md2 --assemble
mdadm /dev/md1 --assemble
mdadm /dev/md0 --assemble
cat /proc/mdstat
#verify your RAID arrays are all disks complete (should show UU)
#now check each one
fsck -y /dev/md0
fsck -y /dev/md1
fsck -y /dev/md2
#now reboot and cross your fingers it all comes up good
reboot

Some notes:
If the disk isn’t heavily used, or you are using ext3 or resierfs, then you stand a decent chance of not losing any data with a fsck -y of the file system.

If the power was lost, or the machine locked up you may lose the last little bit of data, even on a journaled file system.

RAID-1 is awesome, and this can be done with a software RAID-1 even if one of the drives is failed. You can also just mount the disk without RAID, and operate as if it never was in an array.

You can do other maintenance on the drives, such as mount them (mount /dev/md2 /mnt/md2) and modify/copy/backup data/etc instead of, or in addition to, fsck’ing them.

I’ve had the best luck with ext3 and resierfs. I’ve read bad things about XFS, JFS, and other file systems.

I would love for Linux to support ZFS, as I’ve played with it on Open Solaris, and Sun’s VM of their storage appliance, and it seems nice.

Enjoy!

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

  1. No comments yet.
(will not be published)