[tahoe-dev] Estimating reliability
Shawn Willden
shawn-tahoe at willden.org
Fri Jan 9 13:07:28 PST 2009
On Thursday 08 January 2009 07:56:17 pm Brian Warner wrote:
> The basic operation is "UTBF", upper-tail-binomial-distribution, sum([P("i
> shares remain") for i in range(k,N+1)])
That makes perfect sense to me, assuming servers are independent -- which is a
reasonable assumption for the case I care about: a backup network of
geographically-distributed home computers.
> I came up with some approximations, because dealing with numbers that are
> so close to 1.0 wasn't working too well numerically.
That's easy to fix. Instead of working with numbers that are close to 1,
which causes stuff to fall off the end of the mantissa, restate the problem
so the numbers are close to zero and you can use the full precision of the
mantissa.
So what you want to look at is the lower tail of the PMF, not the upper. That
is, you calculate the probability that not enough servers survive and
subtract from 1 to find the probability of survival. Specifically, if K is
the random variable representing the servers that survive, calculate:
1 - sum(Pr[K=i] for 0 <= i < k)
If you want to be sure you're not losing precision, though, there's another
way: Use an arbitrary-precision math library like mpmath. I did some quick
testing with Mathematica, using both machine and infinite precision, and I
found that the error from using machine precision (double) was clear out in
the 14th decimal place. Not an issue. BTW, numbers for k=3, N=10 and a few
values of p (where p is probability of server survival in period A):
For p = .9, P[file loss in A] = 3E-7
For p = .99, P[file loss in A] = 4E-15
For p = .999, P[file loss in A] = 4E-23
For p = .999999, P[file loss in A] = 5E-47
Extending to m periods is straightforward:
(1 - sum(Pr[K=i] for 0 <=i <= k-1))^m
Extending to more than one file is not so straightforward, unless there only N
peers in the network, in which case all files live or die together, or unless
there are a very large number of peers in the network, such that every file
has a disjoint set, in which case if n is the number of peers, it's:
(1 - sum(Pr[K=i] for 0 <=i <= k-1))^nm
> I used decibels, from
> the EE world, with "10 dBA" (decibels of Availability) meaning a 10% chance
> of failure / 90% chance of success, "20 dBA" meaning a 1% failure / 99%
> success, 30 dBA = 0.1% failure / 99.9% success, etc.
Hmm. The logarithm function is monotonic, but if you start summing logs,
you're doing something very different. I'd have to think about it, but I
don't think this gives you what you think it gives you.
In any case, I think it shouldn't be an issue to calculate the cumulative LTBF
to k-1 with floating point numbers, and that gives yout the probability of
file loss.
Based on this, I think I have everything I need to calculate a reasonable
estimate of reliabilty/availability (the distinction isn't relevant for small
networks of people who can call each other to put servers back online) for
backup networks of geographically-distributed home computers.
Thanks!
Shawn.
More information about the tahoe-dev
mailing list