[tahoe-dev] Estimating reliability

Tue Jan 13 10:00:13 PST 2009

On Tuesday 13 January 2009 12:33:48 am Brian Warner wrote:
> Incidentally, the very first peer-selection design we started with
> ("Tahoe1", as opposed to the current "Tahoe2" algorithm, or the "Tahoe3"
> that we backed away from about a year ago) assigned "reliability points" to
> each server (based upon a hypothetical long-term measurement of each
> server), and assigned shares in permuted order until enough points had been
> accumulated. One of the reasons we gave up on it was that I didn't believe
> we could actually measure server reliability in any meaningful+accurate
> way.

I agree that measuring server reliability is hard, especially for servers that 
are highly reliable. I do think, however, that this sort of estimation is 
both useful and possible in many interesting cases, and I think that it's 
much easier to hypothesize and verify than to measure directly.

The hard thing about measuring the incidence of rare events is that they're 
rare -- UNLESS you get a sufficiently large population.  Given a large 
network, I think you'd get very good results by tracking some details about 
each server (location type, system type, storage technology, operating 
system, user experience level, time in network, etc.) to give you a sort 
of "server type" space, then apply a clustering algorithm to find reliability 
clusters in "server type" x "reliability" space.  Each cluster gives you an 
estimated reliability figure for servers that are similar.

For small, professionally-operated networks, I think you can also do something 
useful.  In that context, you should be able to come up with an 
order-of-magnitude reliability estimate just by asking the administrators.  
Over time you shouldn't see very many failures, but you can use those 
failures to test your initial estimates.  If the null hypothesis is not 
supported, then you need to revise your estimates.

In both cases, the key is that reliability measurement and reliability-based 
parameter tuning (k,N,A,R as functions of t and p_i) provide a feedback cycle 
that keeps your real reliability in line with the best available information.  
Ideally, as soon as a failure occurs, the related reliability estimates 
should be automatically revised, and the repairer started to ensure that all 
files are at their desired estimated reliability levels, based on the latest 
information.

> What I meant was that the post-repair chain matrix is derived from the
> previous "decay" chain matrix by identifying all of the states that would
> provoke a file-repair operation, and moving those probabilities over into
> the post-repair N-shares cell.

Gotcha.  That makes perfect sense.

	Shawn.