[tahoe-dev] Fwd: On the value of "proofs"...
Shawn Willden
shawn-tahoe at willden.org
Wed Jan 28 23:02:33 PST 2009
On Wednesday 28 January 2009 08:22:40 pm zooko wrote:
> The chance that we made a significant mistake in our priors or in our
> math is much greater than 10^-9.
Priors, yes. Math, no. This isn't the same sort of math particle physicists
are doing to evaluate safety factors of the LHC. This math is far, far
simpler, and extremely well-understood and in fact widely used by reliability
engineers in just this way. The fact that we may plug in higher individual
reliability estimates than are common, and which combine in ways that are
specifically designed to increase reliability in no way changes the
fundamental validity of the computations.
There *is* a possibility that accumulated rounding error creates problems, but
that's a separate matter, and one that's easy to address using standard
techniques that are taught in any first-semester numerical analysis course.
Another solution, given the simplicity of the calculations, is to use
arbitrary-precision numerical libraries and simply eliminate that
possibility.
> Therefore, if our estimate tells us
> that there is less than a 10^-9 chance of accidentally losing a file,
> we should not rely on that estimate. It can be considered only an
> upper-bound on the reliability.
It depends on what you mean by "rely on". If you mean "just figure that we're
safe forever", then you're absolutely right. The estimates should be
continually refined by validating the input failure probabilities (at the
per-server level there are enough failures to test the assumptions). As data
is accumulated over time, the whole-system reliability estimates will
fluctuate. When they fluctuate downward, the system should compensate by
tuning the various parameters that affect reliability.
For large networks (assuming that actually happens), you can push on it from
the other direction as well. Given a 10^-9 probability of a file being lost,
and 100 million files spread over enough servers we can reasonably call the
file failure modes independent, the math says there's a 10% probability that
at least one file will be lost -- so now you're back in the realm of
measurable failures which can be used to validate the assumptions.
> The other basic argument is that failure probability of tahoe servers
> are not independent of each other.
Of course not. That's the reason I switched from using a straight binomial
distribution calculation to composing PMFs via convolution. With the
approach described in my paper you can DIRECTLY model the common failure
modes. Of course, you're still left with estimating the probability that
Brian goes berserk in the machine room, and I don't know how you do that, but
assuming you can come up with a reasonable number, the math can factor it in,
correctly.
More generally, you make a list of individual and group failure modes, and an
estimate for each, and I can combine them all to give you an overall estimate
that is exactly as accurate as the inputs, no more and no less.
Clearly, then, the trick is figuring out how to obtain and verify good
estimates for the various failure modes. I think that can be done, though.
And where there's doubt, the answer is to be pessimisitc. If your
pessimistic estimates still result in adequate overall reliability, then
fine. If not, then you need to adjust k, N, A and L until you do get
acceptable outputs with pessimistic inputs.
Shawn.
More information about the tahoe-dev
mailing list