[tahoe-dev] Estimating reliability
Shawn Willden
shawn-tahoe at willden.org
Mon Jan 12 01:09:28 PST 2009
On Thursday 08 January 2009 07:56:17 pm Brian Warner wrote:
> This could be a great academic paper. I've been looking for a grad student
> to feed (I hear free lunch can get you a lot of paper-writing time :).
>
> If you find yourself interested in doing this sort of analysis, I'd love to
> sit down with you for a while. I'll buy the beer and supply the whiteboard.
Okay, I'm not a grad student (never went to grad school), don't drink beer and
am probably a thousand miles from your whiteboard (and a free lunch), but I
think I may write that paper.
Getting the simple case was easy enough, per my previous message, but I've
been noodling on it and I think I have a handle on the harder cases,
including one important bit of complexity that you didn't mention: Not all
servers have the same reliability. We actually have a generalized binomial
distribution, where p is really p_i, and may be different for each "trial"
(share).
Suppose random variable X represents the number of survivors from some set of
shares over some period A. Note that this set need not be the entire set of
shares for a file, it's just some non-empty set, of size somewhere between 1
and N. Pr(X=x) is the probability that exactly x of those shares survive the
period.
Now suppose r.v. Y represents the number of surviving shares from another set.
Further suppose that those shares are from the same file as the shares in the
set associated with X, but the two sets are disjoint.
The r.v. Z obtained by summing X and Y is therefore the number of surviving
shares from the union of the two sets. If we have random variables
representing various disjoint sets of shares of the file, we can add them all
up to get a random variable that represents the whole set.
How does that help?
Well, given a set of shares that all have the same reliability p (and are all
independent), the number of survivors follows a standard binomial
distribution, for which we can easily calculate the PMF.
So, if we break the shares for a file into sets of similar shares (shares with
identical reliability), then for each set we can calculate the PMF of the
random variable for that set.
If we can then compute a PMF for the sum of all of those random variables,
then we have the PMF needed to find the probability that at least k shares
survive. The trick here is something called the discrete convolution
theorem, which says that given two random variables X and Y, with PMFs given
by functions f(x) and g(y), the PMF of Z=X+Y is given by (f*g)(z), where "*"
is the convolution operator. Convolution is commutative and associative, so
given a group of PMFs representing disjoint subsets of a file's shares, we
can repeatedly convolve them to generate a full share set PMF.
This provides a way to deal with non-independent shares, as well. Suppose
that two shares live on the same server, so they live or die together. If
that server has a reliability of 0.9, then the PMF for those two shares is:
x = 0: 0.1
x = 1: 0
x = 2: 0.9
There's a 10% chance that the server fails, which means neither share
survives, and a 90% chance the server survives, with both shares. There is
no possible way to get a singe share surviving on that server (under the
assumption that they live or die together).
If the file has five shares total, and the other three live on separate
servers, each with reliability .9, then the binomial distribution gives us
the following PMF for that share set:
y = 0: 0.001
y = 1: 0.027
y = 2: 0.243
y = 3: 0.730
Applying the convolution operator to those two PMFs gives the combined PMF
z = 0: .0001
z = 1: .0027
z = 2: .0252
z = 3: .0972
z = 4: .0217
z = 5: .6561
Assuming k=2, we sum the m=0 and m=1 cases in the PMF to get a probability
of .0028 that not enough shares survive to recover the file, and therefore a
probability of .9972 that the file survives.
This approach should also provide a way to deal with multiple causes of server
failure, some of which are independent and some of which are not. For
example, suppose we can build the PMF for a set of servers that all live in
the same co-lo. This PMF must take into account both independent failures
(e.g. hardware, administrative screwups, etc.) and site failures (e.g. fire).
Given that PMF, we can then convolve it with the PMFs for the other servers
with shares. I haven't worked out how to construct a PMF from multiple
failure factors (and I'm too tired to think it through right now), but my
intuition says it's easy.
If I can work that out, then we can devise a taxonomy of failure modes and
estimate each of them for each server and set of servers, and combine it all
to generate a full share set PMF that takes all of the issues into account.
Using two sets of PMFs, one that includes intermittent failures (perhaps
duration weighted) and one that includes only permanent failures, we can
analyze avaialability vs reliability.
At that point, the only other issues to address are related to repair, and how
to include A and R. A affects all of the probabilities that go into
constructing the PMFs. I think I'll assume that the survivability
probabilities are annual and de-compound them to get monthly (or whatever)
probabilities. R is an interesting case.
If R = N, then at the end of each repair cycle the share set is complete and
the failure probability for the next period is given by the PMF of the
current share set (which may have changed after the repair).
If R < N, then some repair cycles will make no repairs, even though there are
up to N-R failed shares. A Markov chain is then needed to account for the
changed probabilities at each cycle, and the computations may become
intractable. Not sure.
However, I propose that we should handle this issue a different way, and
eliminate R from the model altogether. Instead, we should replace it by t,
which represents the acceptable failure tolerance for the next cycle. Then
the repair agent's job is to calculate for each file the probability that
file will become unreconstructable by the end of the next period, and if that
probability exceeds t, then the repair agent should deploy additional shares
to bring it below t.
That approach has the advantage for every repair cycle period we can assume
that the loss probability is no greater than t. Extrapolating from that to
loss probability over a series of repair periods is then trivial. Even
better, we can work backwards. You tell me what probability of loss is
acceptable over, say, a 10-year period, and I can tell you what t should be,
and what k should be for some assumed share reliability numbers.
Based on t and the file share set PMF, I can also calculate, on average, how
much reconstruction work will be needed, and therefore the bandwidth
required. Want to alter A? No problem, just recalculate t and the bandwidth
requirements.
I need to sleep so I'll stop here.
Shawn.
More information about the tahoe-dev
mailing list