[tahoe-dev] Estimating reliability
Shawn Willden
shawn-tahoe at willden.org
Tue Jan 13 10:00:13 PST 2009
On Tuesday 13 January 2009 12:33:48 am Brian Warner wrote:
> Incidentally, the very first peer-selection design we started with
> ("Tahoe1", as opposed to the current "Tahoe2" algorithm, or the "Tahoe3"
> that we backed away from about a year ago) assigned "reliability points" to
> each server (based upon a hypothetical long-term measurement of each
> server), and assigned shares in permuted order until enough points had been
> accumulated. One of the reasons we gave up on it was that I didn't believe
> we could actually measure server reliability in any meaningful+accurate
> way.
I agree that measuring server reliability is hard, especially for servers that
are highly reliable. I do think, however, that this sort of estimation is
both useful and possible in many interesting cases, and I think that it's
much easier to hypothesize and verify than to measure directly.
The hard thing about measuring the incidence of rare events is that they're
rare -- UNLESS you get a sufficiently large population. Given a large
network, I think you'd get very good results by tracking some details about
each server (location type, system type, storage technology, operating
system, user experience level, time in network, etc.) to give you a sort
of "server type" space, then apply a clustering algorithm to find reliability
clusters in "server type" x "reliability" space. Each cluster gives you an
estimated reliability figure for servers that are similar.
For small, professionally-operated networks, I think you can also do something
useful. In that context, you should be able to come up with an
order-of-magnitude reliability estimate just by asking the administrators.
Over time you shouldn't see very many failures, but you can use those
failures to test your initial estimates. If the null hypothesis is not
supported, then you need to revise your estimates.
In both cases, the key is that reliability measurement and reliability-based
parameter tuning (k,N,A,R as functions of t and p_i) provide a feedback cycle
that keeps your real reliability in line with the best available information.
Ideally, as soon as a failure occurs, the related reliability estimates
should be automatically revised, and the repairer started to ensure that all
files are at their desired estimated reliability levels, based on the latest
information.
> What I meant was that the post-repair chain matrix is derived from the
> previous "decay" chain matrix by identifying all of the states that would
> provoke a file-repair operation, and moving those probabilities over into
> the post-repair N-shares cell.
Gotcha. That makes perfect sense.
Shawn.
More information about the tahoe-dev
mailing list