[volunteergrid2-l] Gratch going down, briefly

Tue Sep 13 05:37:22 PDT 2011

The saga continues.

(Note:  People using gmail may want to hit "m" which will mute this
conversation, meaning you won't see additional messages.  Users of other
mail systems/client have to figure out how to ignore me on their own.)

It turns out that the error is NOT transient.  I should have dug a little
deeper.  There is a set of unreadable blocks on the drive, so every time the
sync process reached them, that drive dropped out of the array.  Handily
enough, the "medium sense error" recorded in the syslog even told me which
block failed.  Using sg_verify I verified that the block was in fact bad,
and then used dd to overwrite it, which should cause the disk to reallocate
that sector.  Indeed, that sector is now readable.  I found some other bad
blocks, too, and forced reallocation.

Clearly, at I need to do a full scan of the disk to decide whether or not
it's even worth continuing to use it.  But first of all, I need to get the
array, PV and LV up and the file system mounted and copy all the Tahoe data
elsewhere.  I mirrored a couple of unused 500 GB partitions, created an XFS
filesystem on them and am copying now.

I've also ordered a replacement drive, though if testing the disk proves
that I just needed to force a few reallocations I'll probably just keep the
new drive as a spare.

I think I'm also going to run full surface scans and smart self-tests on ALL
of my drives.  Hmmm.... didn't I see some tool for making that happen on a
regular schedule?

On Mon, Sep 12, 2011 at 7:53 PM, Shawn Willden <shawn at willden.org> wrote:

> Okay, it's back up.
>
> It turned out to be even smoother than I'd thought.  LVM even started the
> volume group with the bad PV, it just didn't start the LV with the bad PV.
>  And then xfs_repair didn't find anything wrong.
>
> Also, I notice that the first of slush's two nodes is now on-line!  One of
> Caner's is off-line, though (not that I have any room to complain).
>
> On Mon, Sep 12, 2011 at 8:56 AM, Shawn Willden <shawn at willden.org> wrote:
>
>> Sigh.
>>
>> Gratch is down again.  One of the disks in my newly-expanded RAID array
>> crapped out before the sync was complete, taking the array offline
>> completely and taking out the logical volume containing my VG2 Tahoe node.
>>  It appears that it was a transient error, but it's the same model of disk
>> as the last one that failed.  I'll replace it after I get things running
>> again.  Really, though, it's my fault for adding the array to the volume
>> group and extending the logical volume before the sync was complete.  RAID-5
>> is intended to add reliability, but an unsynced array is in danger of
>> failing entirely if any one of the disks has even a transient error, a fact
>> of which I'm well aware but stupidly ignored.
>>
>> I think I can fix it by forcing some things, but I have to get the array
>> offline first, which requires taking the volume group offline, which
>> requires unmounting all logical volumes... and the system won't let me
>> unmount the problem child.  I tried to reboot, but it's hung up trying to
>> unmount so I'm going to have to hit the Big Red Switch... when I get home.
>>
>> In case there's someone here with more knowledge of LVM, MD and XFS than I
>> have, here's what I'm thinking:
>>
>> First, here's the way it's set up:
>>
>> There are a few RAID-5 arrays which are all marked as physical volumes and
>> added to one volume group.  There are a few logical volumes in that volume
>> group, none that are essential to system operation.  One is the Tahoe
>> volume, which is an XFS file system.
>>
>> My thought is that the XFS file system should be fine after running
>> xfs_repair.  The new storage was just added last night and most likely
>> hasn't been used to store any files, so nothing should be lost.  To be able
>> to repair it, I have to get the logical volume back in operational shape,
>> but I'm thinking it doesn't matter so much what is on the damaged portion of
>> the LV.
>>
>> So, what I think I'm going to do is to forcibly restart the RAID array
>> with "--assume-good" and the disks in the same order (that's important!).
>>  This should mean that the portion of the array that was synced will be back
>> like it was before the failure.  The unsynced portion of the array will
>> contain random garbage, but LVM won't care in the slightest, because the PV
>> label should be present and correct.  Once the logical volume is back up, I
>> can run xfs_repair.  The random garbage will annoy xfs_repair, but not
>> fatally, I think.
>>
>> Any comments/suggestions are welcome.
>>
>> Worst case, of course, is that the shares I was holding are lost and y'all
>> are going to have to run a Tahoe repair.  But I'm hopeful it won't come to
>> that.
>>
>>
>> On Sun, Sep 11, 2011 at 7:26 PM, Shawn Willden <shawn at willden.org> wrote:
>>
>>> And now gratch's Tahoe storage is up to 1 TB of RAID-5 storage.
>>>
>>> Speaking of storage, my survey idea was apparently a complete
>>> non-starter, since I didn't get a single response.
>>>
>>>
>>> On Sun, Sep 11, 2011 at 6:38 PM, Shawn Willden <shawn at willden.org>wrote:
>>>
>>>> Gratch is back up.  Was actually up 15 minutes ago.
>>>>
>>>>
>>>> On Sun, Sep 11, 2011 at 6:09 PM, Shawn Willden <shawn at willden.org>wrote:
>>>>
>>>>> Had a drive fail a couple weeks ago and I'm finally getting around to
>>>>> installing the replacement.  Shouldn't be down more than a few minutes.
>>>>>
>>>>> --
>>>>> Shawn
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Shawn
>>>>
>>>
>>>
>>>
>>> --
>>> Shawn
>>>
>>
>>
>>
>> --
>> Shawn
>>
>
>
>
> --
> Shawn
>

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20110913/5584dbf4/attachment.html>