[tahoe-dev] Estimating reliability

Mon Jan 12 01:09:28 PST 2009

On Thursday 08 January 2009 07:56:17 pm Brian Warner wrote:
> This could be a great academic paper. I've been looking for a grad student
> to feed (I hear free lunch can get you a lot of paper-writing time :).
>
> If you find yourself interested in doing this sort of analysis, I'd love to
> sit down with you for a while. I'll buy the beer and supply the whiteboard.

Okay, I'm not a grad student (never went to grad school), don't drink beer and 
am probably a thousand miles from your whiteboard (and a free lunch), but I 
think I may write that paper.

Getting the simple case was easy enough, per my previous message, but I've 
been noodling on it and I think I have a handle on the harder cases, 
including one important bit of complexity that you didn't mention:  Not all 
servers have the same reliability.  We actually have a generalized binomial 
distribution, where p is really p_i, and may be different for each "trial" 
(share).

Suppose random variable X represents the number of survivors from some set of 
shares over some period A.  Note that this set need not be the entire set of 
shares for a file, it's just some non-empty set, of size somewhere between 1 
and N.  Pr(X=x) is the probability that exactly x of those shares survive the 
period.

Now suppose r.v. Y represents the number of surviving shares from another set.  
Further suppose that those shares are from the same file as the shares in the 
set associated with X, but the two sets are disjoint.

The r.v. Z obtained by summing X and Y is therefore the number of surviving 
shares from the union of the two sets.  If we have random variables 
representing various disjoint sets of shares of the file, we can add them all 
up to get a random variable that represents the whole set.

How does that help?

Well, given a set of shares that all have the same reliability p (and are all 
independent), the number of survivors follows a standard binomial 
distribution, for which we can easily calculate the PMF.

So, if we break the shares for a file into sets of similar shares (shares with 
identical reliability), then for each set we can calculate the PMF of the 
random variable for that set.

If we can then compute a PMF for the sum of all of those random variables, 
then we have the PMF needed to find the probability that at least k shares 
survive.  The trick here is something called the discrete convolution 
theorem, which says that given two random variables X and Y, with PMFs given 
by functions f(x) and g(y), the PMF of Z=X+Y is given by (f*g)(z), where "*" 
is the convolution operator.  Convolution is commutative and associative, so 
given a group of PMFs representing disjoint subsets of a file's shares, we 
can repeatedly convolve them to generate a full share set PMF.

This provides a way to deal with non-independent shares, as well.  Suppose 
that two shares live on the same server, so they live or die together.  If 
that server has a reliability of 0.9, then the PMF for those two shares is:

x = 0: 0.1
x = 1: 0
x = 2: 0.9

There's a 10% chance that the server fails, which means neither share 
survives, and a 90% chance the server survives, with both shares.  There is 
no possible way to get a singe share surviving on that server (under the 
assumption that they live or die together).

If the file has five shares total, and the other three live on separate 
servers, each with reliability .9, then the binomial distribution gives us 
the following PMF for that share set:

y = 0: 0.001
y = 1: 0.027
y = 2: 0.243
y = 3: 0.730

Applying the convolution operator to those two PMFs gives the combined PMF

z = 0: .0001
z = 1: .0027
z = 2: .0252
z = 3: .0972
z = 4: .0217
z = 5: .6561

Assuming k=2, we sum the m=0 and m=1 cases in the PMF to get a probability 
of .0028 that not enough shares survive to recover the file, and therefore a 
probability of .9972 that the file survives.

This approach should also provide a way to deal with multiple causes of server 
failure, some of which are independent and some of which are not.  For 
example, suppose we can build the PMF for a set of servers that all live in 
the same co-lo.  This PMF must take into account both independent failures 
(e.g. hardware, administrative screwups, etc.) and site failures (e.g. fire).  
Given that PMF, we can then convolve it with the PMFs for the other servers 
with shares.  I haven't worked out how to construct a PMF from multiple 
failure factors (and I'm too tired to think it through right now), but my 
intuition says it's easy.

If I can work that out, then we can devise a taxonomy of failure modes and 
estimate each of them for each server and set of servers, and combine it all 
to generate a full share set PMF that takes all of the issues into account.  
Using two sets of PMFs, one that includes intermittent failures (perhaps 
duration weighted) and one that includes only permanent failures, we can 
analyze avaialability vs reliability.

At that point, the only other issues to address are related to repair, and how 
to include A and R.  A affects all of the probabilities that go into 
constructing the PMFs.  I think I'll assume that the survivability 
probabilities are annual and de-compound them to get monthly (or whatever) 
probabilities.  R is an interesting case.

If R = N, then at the end of each repair cycle the share set is complete and 
the failure probability for the next period is given by the PMF of the 
current share set (which may have changed after the repair).

If R < N, then some repair cycles will make no repairs, even though there are 
up to N-R failed shares.  A Markov chain is then needed to account for the 
changed probabilities at each cycle, and the computations may become 
intractable.  Not sure.

However, I propose that we should handle this issue a different way, and 
eliminate R from the model altogether.  Instead, we should replace it by t, 
which represents the acceptable failure tolerance for the next cycle.  Then 
the repair agent's job is to calculate for each file the probability that 
file will become unreconstructable by the end of the next period, and if that 
probability exceeds t, then the repair agent should deploy additional shares 
to bring it below t.

That approach has the advantage for every repair cycle period we can assume 
that the loss probability is no greater than t.  Extrapolating from that to 
loss probability over a series of repair periods is then trivial.  Even 
better, we can work backwards.  You tell me what probability of loss is 
acceptable over, say, a 10-year period, and I can tell you what t should be, 
and what k should be for some assumed share reliability numbers.

Based on t and the file share set PMF, I can also calculate, on average, how 
much reconstruction work will be needed, and therefore the bandwidth 
required.  Want to alter A?  No problem, just recalculate t and the bandwidth 
requirements.

I need to sleep so I'll stop here.

	Shawn.