[volunteergrid2-l] Gratch going down, briefly

Mon Sep 12 07:56:59 PDT 2011

Sigh.

Gratch is down again.  One of the disks in my newly-expanded RAID array
crapped out before the sync was complete, taking the array offline
completely and taking out the logical volume containing my VG2 Tahoe node.
 It appears that it was a transient error, but it's the same model of disk
as the last one that failed.  I'll replace it after I get things running
again.  Really, though, it's my fault for adding the array to the volume
group and extending the logical volume before the sync was complete.  RAID-5
is intended to add reliability, but an unsynced array is in danger of
failing entirely if any one of the disks has even a transient error, a fact
of which I'm well aware but stupidly ignored.

I think I can fix it by forcing some things, but I have to get the array
offline first, which requires taking the volume group offline, which
requires unmounting all logical volumes... and the system won't let me
unmount the problem child.  I tried to reboot, but it's hung up trying to
unmount so I'm going to have to hit the Big Red Switch... when I get home.

In case there's someone here with more knowledge of LVM, MD and XFS than I
have, here's what I'm thinking:

First, here's the way it's set up:

There are a few RAID-5 arrays which are all marked as physical volumes and
added to one volume group.  There are a few logical volumes in that volume
group, none that are essential to system operation.  One is the Tahoe
volume, which is an XFS file system.

My thought is that the XFS file system should be fine after running
xfs_repair.  The new storage was just added last night and most likely
hasn't been used to store any files, so nothing should be lost.  To be able
to repair it, I have to get the logical volume back in operational shape,
but I'm thinking it doesn't matter so much what is on the damaged portion of
the LV.

So, what I think I'm going to do is to forcibly restart the RAID array with
"--assume-good" and the disks in the same order (that's important!).  This
should mean that the portion of the array that was synced will be back like
it was before the failure.  The unsynced portion of the array will contain
random garbage, but LVM won't care in the slightest, because the PV label
should be present and correct.  Once the logical volume is back up, I can
run xfs_repair.  The random garbage will annoy xfs_repair, but not fatally,
I think.

Any comments/suggestions are welcome.

Worst case, of course, is that the shares I was holding are lost and y'all
are going to have to run a Tahoe repair.  But I'm hopeful it won't come to
that.

On Sun, Sep 11, 2011 at 7:26 PM, Shawn Willden <shawn at willden.org> wrote:

> And now gratch's Tahoe storage is up to 1 TB of RAID-5 storage.
>
> Speaking of storage, my survey idea was apparently a complete non-starter,
> since I didn't get a single response.
>
>
> On Sun, Sep 11, 2011 at 6:38 PM, Shawn Willden <shawn at willden.org> wrote:
>
>> Gratch is back up.  Was actually up 15 minutes ago.
>>
>>
>> On Sun, Sep 11, 2011 at 6:09 PM, Shawn Willden <shawn at willden.org> wrote:
>>
>>> Had a drive fail a couple weeks ago and I'm finally getting around to
>>> installing the replacement.  Shouldn't be down more than a few minutes.
>>>
>>> --
>>> Shawn
>>>
>>
>>
>>
>> --
>> Shawn
>>
>
>
>
> --
> Shawn
>

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20110912/0ba17d92/attachment.html>