[volunteergrid2-l] Failure Analysis

Wed Feb 2 17:31:33 PST 2011

Thanks, Shawn. Good stuff. I had thought of some of those when I composed
that in my head a couple hours earlier, but they went missing when I got to
a keyboard.

I'm going to start working on a wiki page for this to make it easier to keep
it up to day. I do want to keep the discussion live, though. It helps me
think through these things when I get direct critical feedback like yours.
You'll see some of your ideas broken out specifically on the page. (I'm not
going to bother with attribution on the wiki page, this time, though.)

All failure analysis should be in a single document per node, with
"overlapping failure" links between nodes. Once we start getting a handle on
the grid and the failures, we will be aware of having too many nodes under
single failure points.

As you pointed out with drive failure, RAID systems are less subject to
catastrophic data loss on a single drive, but my 500 GB LVM built out of a
pair old drives would be considered "fragile." This kind of information will
be helpful when get ready to start retiring nodes in the future. We will
also be able to "recommend" or "strongly recommend" system or subsystem
upgrades based on the node characteristics. At some point, pulling those two
drives and replacing them with a single larger drive becomes cost effective
-- but we're not there, yet.

It's been a few years since I did full-bore failure analysis. This will be
fun.

j
----
- Think carefully.

On Wed, Feb 2, 2011 at 10:14 AM, Shawn Willden <shawn at willden.org> wrote:

> On Wed, Feb 2, 2011 at 9:26 AM, Jody Harris <jharris at harrisdev.com> wrote:
>
>> What things can take a node offline?
>>
>>    - Node crash (Tahoe falls down)
>>    - Computer crash
>>    - ISP outage
>>    - Power failure - house, neighborhood, city, regional
>>    - ISP upstream outage (my biggest off-line cause)
>>
>>
> Some others:
>
>    - Administrative error (e.g. rm -rf)
>    - Router failure
>    - Localized catastrophe (e.g. building burns down)
>    - Large-scale catastrophe (e.g. major earthquake)
>
> You could argue that the last two are just forms of ISP/power outages, and
> perhaps router failure could be lumped in with ISP outage.  For that matter,
> "Computer crash" can be broken down into failure sub-modes, primarily
> failures of different components.
>
> Disk failures are sufficiently common that it might be useful to break them
> out, especially for systems with storage architectures that make data loss
> more or less likely.  For example, my Tahoe node storage currently resides
> on a RAID-5 array, but I'm planning to migrate it to a non-redundant LVM
> pool (similar to RAID-0), so I'll be going from a storage architecture where
> the data loss requires near-simultaneous failures of two of six disks to an
> architecture where data will be lost if any one of four disks fails.
>
> If you really want to model the failure modes of the volunteergrid2, I
> think taking into account each node's storage architecture is important.
>
> Also, there should probably be two models, one that focuses on permanent
> failures that cause data loss, and one that focuses on transient failures
> that address data availability (and perhaps another that focuses on write
> availability).
>
> If everyone wants to pitch in and help with defining the structure and
> content of the models, I've worked out some nice mathematical tools for
> translating those models into comprehensive probability estimates.
>
> --
> Shawn.
>
> _______________________________________________
> volunteergrid2-l mailing list
> volunteergrid2-l at tahoe-lafs.org
> http://tahoe-lafs.org/cgi-bin/mailman/listinfo/volunteergrid2-l
> http://bigpig.org/twiki/bin/view/Main/WebHome
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20110202/fbf314b1/attachment.html>