[volunteergrid2-l] Solving hardware problems with software
Shawn Willden
shawn at willden.org
Mon Feb 14 10:02:13 PST 2011
On Mon, Feb 14, 2011 at 10:28 AM, Steve Dodson <steve.dodson at gmail.com> wrote:
> Just now catching up on some email; Thanks Shawn for posting this.
Very welcome. Actually, I should have posted a followup... I'll take
this reminder to do that now.
> Had never heard of ATA over Ethernet.
What a concept, huh?
> Fascinating server setup; what do you use to monitor your drives / RAID?
Just the regular mdadm tools. I get an e-mail if a drive fails and an
array becomes degraded.
So, here's my followup story, which should be entitled "How user
stupidity can defeat even the coolest software".
So, after I got my new drive set up, accessible over AOE (ATA Over
Ethernet), I integrated its partitions in to my RAID arrays and got
everything up and running just the way I wanted it. Great! Time to
move the new 2 TB drive from the desktop machine, where it's being
used via AOE, and into the server, replacing a 500 GB drive which has
been removed from the RAID arrays.
So, I shut down the computers. Both of them. In the WRONG ORDER.
Yes, I shut down the desktop machine first, causing the AOE-accessible
drive to disappear and the file server to hang for a while and then
decide that all the AOE-accessible partitions have failed, and thus
creating a degraded state across all arrays simultaneously -- exactly
the situation I wanted to avoid in the first place!
If I'd been willing to do that, I could have just skipped the whole
AOE process, powered down the server, swapped out the drives and then
brought it up with degraded arrays, which I'd then proceed to repair,
adding the new drive's partitions in. Doh! As soon as it hit me what
I'd done, I felt like pounding my head against the wall. Stupid,
stupid!
Even worse, after I swapped out the drives, the server would no longer
boot. That terrified me for a little while, until I realized that it
just so happened that the new drive had been picked by the BIOS as the
boot drive. I actually had installed the grub MBR on it, so it
started to boot, but then couldn't find the next stage. Anyway, after
a couple of fearful minutes I told grub to use a different drive as
it's root and the machine started up fine. (I actually boot from a
RAID-0 partition, in which the new drive is a participant).
So, no harm done overall, and my most important data, which lives on
RAID-6 arrays, was never in significant danger, but I ended up having
all of the arrays in degraded state simultaneously, which is what I
wanted to avoid with the whole AOE experiment.
So: If you use AOE, shut down the machine using the remote drives
BEFORE you shut down the machine hosting them. That should be
blindingly obvious. I guess the real lesson is "Think before acting".
I also learned that AOE is cool and works very well. If I wanted to
build a cheap, moderate-performance high-capacity SAN on old hardware
it would be the way to go, because you could scale it out to many
disks in many machines. Plenty of redundancy would be crucial,
though, because adding that many links to the chain will increase
failures.
--
Shawn.
More information about the volunteergrid2-l
mailing list