Once it’s working, wouldn’t it be nice if it stayed working?
The RAID array fell over again. At first, I thought it might be samba causing the network access problem, but restarting that didn’t achieve anything. Then I ran top on the server box, to see if anything had spun out of control, but everything was calm – no processes running at over 0.2% of CPU. Then I thought I’d check the syslog, which is where I started to find the problems.
Nov 5 03:09:57 chino kernel: ata4: command 0x35 timeout, stat 0x50 host_stat 0x24
I looked up the syntax of some mdadm commands, and ran this, which I hoped would give more info:
mdadm --detail /dev/md0
That one froze the terminal, and so did this:
cat /proc/mdstat
and so did this:
mdadm --stop /dev/md0
At this point I’d given up on a nice easy fix, and I tried to reboot the machine. It wouldn’t restart, though. Something was preventing it from shutting down. I’m not a big fan of the power switch reboot, so I tried to kill the relevant md processes. However, one process:
root 17110 6 0 Sep12 ? 00:24:39 [md0_raid5]
wouldn’t die, even when hit with a kill -9, and shutdown -h wouldn’t work. So power switch it was.
After restart, I looked through the dmesg output, and found this:
SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sda: drive cache: write back SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sda: drive cache: write back sda: sda1 sd 0:0:0:0: Attached scsi disk sda SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) SCSI device sdb: drive cache: write through SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) SCSI device sdb: drive cache: write through sdb: sdb1 sd 1:0:0:0: Attached scsi disk sdb SCSI device sdc: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sdc: drive cache: write back SCSI device sdc: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sdc: drive cache: write back sdc: sdc1 sd 2:0:0:0: Attached scsi disk sdc SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sdd: drive cache: write back SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB) SCSI device sdd: drive cache: write back sdd: sdd1 sd 3:0:0:0: Attached scsi disk sdd md: md0 stopped. md: bind<sdc1> md: bind<sda1> md: bind<sdb1> md: bind<sdd1> md: kicking non-fresh sdc1 from array! md: unbind<sdc1> md: export_rdev(sdc1) raid5: device sdd1 operational as raid disk 1 raid5: device sdb1 operational as raid disk 3 raid5: device sda1 operational as raid disk 2 raid5: allocated 4202kB for md0 raid5: raid level 5 set md0 active with 3 out of 4 devices, algorithm 2 RAID5 conf printout: --- rd:4 wd:3 fd:1 disk 1, o:1, dev:sdd1 disk 2, o:1, dev:sda1 disk 3, o:1, dev:sdb1 kjournald starting. Commit interval 5 seconds
One line in particular stood out:
md: kicking non-fresh sdc1 from array!
A quick Google search reveals that this same message came up during my last RAID failure. It looks like a hardware problem. For now, I’ve just added the disk back in and rebuilt the array:
e2fsck -n -f -v /dev/md0
followed by:
mdadm -a /dev/md0 /dev/sdc1
In fairness, this is exactly what RAID is designed for. I’ve had hardware failure, but my data is safe because of the redundancy of RAID5. On the other hand, I will have to get that drive replaced. It’s less than a year old, so I’d expected it to last a bit longer.
I recently heard that the massive density of modern drives makes them less reliable. Apparently smaller drives of 150MB or so are more reliable.
Comment by Iain — November 6, 2006 @ 3:30 pm
Well, the more data you put on a single drive, the less reliable it is per megabyte, ceteris paribus.
Another factor favouring smaller disks is performance. In a RAID system with a high thoughpur requirement, larger disks means fewer disk heads per megabyte, which means slower throughput.
Comment by ealing — November 8, 2006 @ 1:34 pm