[tahoe-dev] Estimating reliability

Brian Warner warner-tahoe at allmydata.com
Thu Jan 8 18:56:17 PST 2009


> Has anyone done any work on code to estimate relability, given assumptions
> for k, N, peer network size and peer availability? More to the point, has
> anyone written any code to compute recommended values for k and N, to
> achieve specified reliability for a getwork of a given size and an
> assumption about peer availability?

I've put a couple of days into this, but not recently. My PyCon paper
(http://allmydata.org/~warner/pycon-tahoe.html) has a section near the
end. Tahoe's built-in 'provisioning page' (reachable from the web root)
is one of the artifacts. docs/architecture.txt in the tahoe source tree
(towards the end) contains a brief overview. Finally,
http://allmydata.org/pipermail/tahoe-dev/2007-September/000178.html
contains some brief discussion about the issue. I don't think I've
managed to write up my work online yet.. there are a lot of diagrams in
my notebook, though.

The rough idea is that we model servers as independent, being online and
available some high percentage of the time, and having some multi-year
lifetime before the hard drive fails permanently. I've been using 90%, 95%,
99%, 99.9%, 99.99%, and 99.999% as a range in my tables for the former. For
the latter, nobody really knows what sort of model to use, so I've played
with uniform-distribution (random.uniform(0,10*YEAR)), and exponential with a
5yr half-life.

Availability is the question of which servers have shares right now.
Reliability is the question of which servers will have shares in the future.
We can combine the two by phrasing the question as: in T seconds from now, I
will start trying to download the file, and I'll wait up to N seconds to get
all the shares, what is the chance that I will succeed? You can then think of
"availability" as the number we get if N goes to zero, and "reliability" when
N goes to infinity.

We also model a file-check operation occurring every A seconds (probably once
a month: this is a bandwidth vs reliability tradeoff), and a "repair
threshold" R (perhaps set to 7 when using 3-of-10 encoding: too low and we
risk losing files, too high and we spend a lot on unnecessary repair). If the
file-check sees the file is below the repair threshold, it performs a repair.
We assume that repair happens instantly, even though really there will be a
queue, with the least healthy files going first). So the interesting thing to
look at is the probability that the file goes from R shares to less than K
shares during an A-second window of time. You can think of the number of
shares as decaying exponentially, starting with N at the moment of
upload/repair, falling as individual servers/disks fail, rising when a repair
operation takes place (which occurs when we do a check and the number of
shares was found to be below R).

The basic operation is "UTBF", upper-tail-binomial-distribution, sum([P("i
shares remain") for i in range(k,N+1)]) . The binomial distribution pmf has
an S shape (0 at 0, 50% at P("one server fails")*numservers, 100% at
numservers), but when we're dealing with very reliable servers, it is way
skewed (it only looks like a bell-curve at p=0.5, and we're talking p=0.999).

I came up with some approximations, because dealing with numbers that are so
close to 1.0 wasn't working too well numerically. I used decibels, from the
EE world, with "10 dBA" (decibels of Availability) meaning a 10% chance of
failure / 90% chance of success, "20 dBA" meaning a 1% failure / 99% success,
30 dBA = 0.1% failure / 99.9% success, etc.

Then I approximated the UTBF function by noticing that it appears to be
dominated by the i=k-1 component: if you have 10 shares and need 3 to recover
the file, you win if you have "i in (3,4,5,6,7,8,9,10)" shares left, you lose
if you have "i in (0,1,2)" shares left, and the chance of getting i=2 shares
is drastically larger than the chance of getting i=1. So I discarded all the
[i in range(0,k-2)] terms and looked only at the i=k-1 . P(i) is basically
(p**i)*( (1-p)**(N-i) ) * (N-choose-i), and in the cases I examined, p**i was
effectively 1. When I looked at everyting in terms of dBA (i.e. take the log
of both sides), this resulted in a formula in which the dBA of the file
overall looked like X+Y*Z, where the X comes from the N-choose-i term, the Y
is the per-server availability, and the Z comes from the N-i term.

The end result was that the availability of the file grows exponentially with
the availability of a given server, and the exponent is basically equal to
(N-k-1), i.e. the number of shares you have to lose before you lose the file.
FEC drastically magnifies the availability of an individual server. (note
that the opposite is sort of true: if we had crummy servers with p<0.5, then
the availability is horrible).

I haven't figured out how to translate this per-file availability analysis
into long-term reliability numbers, but my plan was to build some simulations
and maybe come up with a table like "out of 10000 runs, starting with R
shares and simulating for A seconds, 73 had fewer than k shares at the end of
the run", which would tell us that worst-case (i.e. all files were right at
the repair threshold at the last check), we'd lose .73% of all files each A
seconds. Really we'd have a distribution of healths at the file-check time,
so things would be a bit better.

The overall idea is that all files have a half-life (if you model them with
the ubiquitous exponential distribution), or equivalently a certain chance of
surviving one more year. A single copy (k=1,N=1) has a half-life equal to
that of a single disk. An expanded copy without repair (e.g.
k=3,N=10,A=infinity) has a longer half-life. A maintained copy
(k=3,N=10,A=1month) has a much longer (but still not infinite) half-life.

N/k controls disk usage. A and R control network bandwidth (for checking and
repairing). All four contribute to reliability/halflife.


This could be a great academic paper. I've been looking for a grad student to
feed (I hear free lunch can get you a lot of paper-writing time :).

If you find yourself interested in doing this sort of analysis, I'd love to
sit down with you for a while. I'll buy the beer and supply the whiteboard.

cheers,
 -Brian


More information about the tahoe-dev mailing list