[tahoe-dev] erasure coding makes files more fragile, not less

Wed Mar 28 13:00:56 UTC 2012

Although your provocative statement tries so hard to be provocative it
fails to be true, there's clearly a kernel of truth there -- and there's no
argument at all that Tahoe needs better monitoring and more transparency.

The arguments make are the basis for the approach I (successfully) pushed
when we started the VG2 grid:  We demand high uptime from individual
servers because the math of erasure coding works against you when the
individual nodes are unreliable, and we ban co-located servers and prefer
to minimize the number of servers owned and administered by a single person
in order to ensure greater independence.

How has that worked out?  Well, it's definitely constrained the growth rate
of the grid.  We're two years in and still haven't reached 20 nodes.  And
although our nodes have relatively high reliability, I'm not sure we've
actually reached the 95% uptime target -- my node, for example, was down
for over a month while I moved, and we recently had a couple of outages
caused by security breaches.

However, we do now have 15 solid, high-capacity, relatively available (90%,
at least) nodes that are widely dispersed geographically (one in Russia,
six in four countries in Europe, seven in six states in the US; not sure
about the other).  So it's pretty good -- though we do need more nodes.

I can see two things that would make it an order of magnitude better:
 monitoring and dynamic adjustment of erasure-coding parameters.

Monitoring is needed both to identify cases where file repairs need to be
done before they become problematic and to provide the node reliability
data required to dynamically determine erasure coding parameters.

Dynamic calculation of erasure coding parameters is necessary both to
improve transparency and to provide more reliability.  The simple 3-of-7
(shares.total is meaningless; shares.happy is what matters) default
parameters do not automatically provide high reliability, even if server
failure is independent (and the direct relationship between individual
server reliability and K/N is meaningless; it's more complicated than that).

The only way erasure coding parameters can be appropriately selected is by
doing some calculations based on knowledge of the size of the available
storage nodes and their individual reliabilities.  Since these factors
change over time, therefore, the only way to know what the parameters
should be at the moment of upload is calculate them dynamically.
 Specifically, N/H should be set to the number of storage nodes currently
accepting shares and K should be computed to meet a user-specified per-file
reliability probability over a user-specified timeframe (the repair
interval).

Not only would this approach make it easier for users to specify their
reliability goals (at the expense of less-predictable expansion), it would
also make Tahoe inherently more robust, particularly if it actually
observed and measured individual node reliabilities over time, with
conservative initial assumptions.  It would likely reduce failure-to-upload
errors, because rather than just giving up when there aren't "enough"
storage nodes available, it would just increase redundancy.  At the same
time, it would be able to properly fail uploads when it is simply
impossible to meet the desired reliability goals.

It would also simplify repair and monitoring, at least from a conceptual
perspective.  The goal of a "reliability monitor" would be to check to see
if, under current estimates of reliability of the nodes holding a file's
shares, if that file's estimated reliability meets the stated user
requirement (assuming independence of node failures -- interdependence
actually could also be easily factored into the calculations, but
configuration would be a bear and it would require lots of ad-hoc estimates
of hard-to-measure probabilities).  It wouldn't even be difficult to
include path-based considerations in reliability estimates.

The biggest downside of this approach, I think, would be that it would
still be hard to understand how the specified reliability relates to the *
actual* file reliability, because it would be neither an upper bound nor a
lower bound but an estimate with unknown deviation.

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20120328/bc9f25ea/attachment.html>