[tahoe-dev] Fwd: On the value of "proofs"...

Wed Jan 28 23:02:33 PST 2009

On Wednesday 28 January 2009 08:22:40 pm zooko wrote:
> The chance that we made a significant mistake in our priors or in our
> math is much greater than 10^-9.

Priors, yes.  Math, no.  This isn't the same sort of math particle physicists 
are doing to evaluate safety factors of the LHC.  This math is far, far 
simpler, and extremely well-understood and in fact widely used by reliability 
engineers in just this way.  The fact that we may plug in higher individual 
reliability estimates than are common, and which combine in ways that are 
specifically designed to increase reliability in no way changes the 
fundamental validity of the computations.

There *is* a possibility that accumulated rounding error creates problems, but 
that's a separate matter, and one that's easy to address using standard 
techniques that are taught in any first-semester numerical analysis course.

Another solution, given the simplicity of the calculations, is to use 
arbitrary-precision numerical libraries and simply eliminate that 
possibility.

> Therefore, if our estimate tells us 
> that there is less than a 10^-9 chance of accidentally losing a file,
> we should not rely on that estimate.  It can be considered only an
> upper-bound on the reliability.

It depends on what you mean by "rely on".  If you mean "just figure that we're 
safe forever", then you're absolutely right.  The estimates should be 
continually refined by validating the input failure probabilities (at the 
per-server level there are enough failures to test the assumptions).  As data 
is accumulated over time, the whole-system reliability estimates will 
fluctuate.  When they fluctuate downward, the system should compensate by 
tuning the various parameters that affect reliability.

For large networks (assuming that actually happens), you can push on it from 
the other direction as well.  Given a 10^-9 probability of a file being lost, 
and 100 million files spread over enough servers we can reasonably call the 
file failure modes independent, the math says there's a 10% probability that 
at least one file will be lost -- so now you're back in the realm of 
measurable failures which can be used to validate the assumptions.

> The other basic argument is that failure probability of tahoe servers
> are not independent of each other.

Of course not.  That's the reason I switched from using a straight binomial 
distribution calculation to composing PMFs via convolution.  With the 
approach described in my paper you can DIRECTLY model the common failure 
modes.  Of course, you're still left with estimating the probability that 
Brian goes berserk in the machine room, and I don't know how you do that, but 
assuming you can come up with a reasonable number, the math can factor it in, 
correctly.

More generally, you make a list of individual and group failure modes, and an 
estimate for each, and I can combine them all to give you an overall estimate 
that is exactly as accurate as the inputs, no more and no less.

Clearly, then, the trick is figuring out how to obtain and verify good 
estimates for the various failure modes.  I think that can be done, though.  
And where there's doubt, the answer is to be pessimisitc.  If your 
pessimistic estimates still result in adequate overall reliability, then 
fine.  If not, then you need to adjust k, N, A and L until you do get 
acceptable outputs with pessimistic inputs.

	Shawn.