[tahoe-dev] Kragen Sitaker, Brian Warner, Josh Wilcox, Bram Cohen, Russ Roberts, and Nassim Taleb on erasure coding probabilities

Wed Sep 26 13:00:01 PDT 2007

is it possible that Bram means something like "<0.95 means that human  
beings are using the power switch, so you would (a) need to analyze human  
psychotics and (b) might see correlated failures when many people are  
switching off during the same baseball event?"

arno

Am 26.09.2007, 19:56 Uhr, schrieb zooko <zooko at zooko.com>:

> Folks:
>
> So I chatted with Kragen Sitaker on IRC last week, and asked him if
> he knew of a way to get error bounds on
>
> SUM for i=0..k ( choose(i, n) * P^i * (1-P)^(n-i) )
>
> This is the formula for the reliability of a file, if the file is
> encoded with K-out-of-N erasure coding, and each share has a
> reliability of P.
>
> Brian Warner has already calculated some approximations of this
> formula for some K,N,P that we care about (see below), but I was
> uncomfortable with those approximations because I didn't know their
> error bounds.  (Note that Josh Wilcox has already posted [1] a
> recursive relation which might prove useful for computing this value
> precisely using rationals, but I was curious if Kragen knew of
> another technique.)
>
> Kragen said something along the lines of: "This is assuming that the
> reliability of the N servers are independent, right?  I'm not sure
> that this is a fair assumption."
>
> I replied that this is a good point, and I mentioned how I rather
> agreed with Bram Cohen's quip that "Engineering for more than six
> nines is nothing but intellectual masturbation" [2].  (Although
> personally I'm more comfortable with five nines.  I can engineer for
> one-in-a-hundred-thousand events more confidently than for one-in-a-
> million.)
>
> Please see Bram's post [2] and Kragen's follow-up [3].  Bram's reply
> to Kragen is incomplete: Bram doesn't explain why servers with less
> than 0.95 reliability can't be confidently analyzed, or why they are
> more likely to have correlated failure than servers with 0.99
> reliability, and those two assertions are not intuitively obvious to me.
>
> Kragen said to me on IRC last week that Nassim Taleb has recently
> written a good book arguing that this kind of modelling is wrong.  I
> haven't read those books yet, and I didn't understand Kragen's point
> about this.  Maybe Kragen wants to follow-up to this and explain?
> Also maybe Bram could follow-up to this and explain his intriguing
> assertions from his livejournal.
>
> Here is a podcast in which economist Russ Roberts interviews Nassim
> Taleb about his notion of "black swans" and "mediocristan vs.
> extremistan".  It's very interesting, and parts of it are quite
> relevant to the current issue of how to model reliability: [4].
>
> Currently I believe that we can reasonably use the above formula to
> compute things with up to five nines.  We can answer questions such as:
>
>   * If K=3 and N=10, then what is the server reliability P above
> which the file reliability will be >= 0.99999?
>
> I do not think we can correctly use this formula to answer some other
> questions that we really want to answer, and that we are currently
> using this formula to answer.  See the otherwise excellent
> provisioning tool that comes with Tahoe thanks to Brian Warner --
> it's on the welcome page -- press the button labelled "Compute", and
> see the output labelled "availability of ... all files in grid".  The
> value is an attempt to answer the question:
>
>   * If K=3 and N=10, and P=0.99, and there are 50,000,000 files, then
> what is the probability that every one of the 50,000,000 files will
> be available?
>
> There are three reasons why the above formulation does not apply to
> the latter question.  First, the reliability of servers is not
> independent.  Second, the reliability of files is not independent.
> Third, the accuracy of the solution to the per-file reliability is
> unknown, and it is possible that the error in that number gives rise
> to critical error in the computation of the for-all-files reliability.
>
> So I currently believe that the answers that we have produced to the
> second question (in the provisioning tool) are not justified and
> should not be relied upon or offered as fact to customers/users/
> investors/etc..
>
> Of course, using the formula to answer the first question also runs
> afoul of the fact that server reliability is not independent.  We can
> resolve this contradiction by either:
>
> 1.  Asserting that the chance of correlated failures such as arson at
> a co-lo or a critical bug in the server software is so much less than
> 1 in 100,000 that it doesn't invalidate our answer to the first
> question, or
>
> 2.  Explicitly setting out those correlated failures in our answer,
> e.g.: "Assuming that the only kind of failure is a server failure --
> e.g. an IDE hard drive crash -- which is uncorrelated with the
> failures of other servers, then the answer is BLAH."  This means that
> the answer BLAH is not valid in the event of fire, flood, war,
> terrorism, act of god, penetration by a hacker, or mistake on the
> part of the Tahoe programmers.  There is a good reason that insurance
> contracts come with that sort of language.
>
> Explicitly setting out those causes is a legitimate approach to both
> of the questions listed above.  Asserting that the cumulative
> probability of such causes is insignificant is not a legitimate
> approach to the second question listed above, because to compute an
> answer to the second question in the way that we currently do it
> requires some intermediate values which have astronomical precision
> in them, such as the "availability of ... arbitrary file" field in
> the provisioning tool, which has fourteen nines!
>
>
> By the way, this analysis explains one of my operational quirks --
> for a production environment I prefer for the software running on the
> servers to be heterogeneous rather than homogeneous.  This is of
> course the opposite of what actual operations guys want, because it
> makes it harder for them to mentally model and to manage the whole
> farm of servers, but hopefully this discussion explains the appeal of
> it: it reduces the correlation between software bugs present in
> different servers.  I think that the chance of critical bugs in
> software is much higher than the chance of arson at the co-lo, and I
> suspect that the chance of critical bugs in software may be high
> enough that it invalidates our five nines.
>
>
> Note: this is not to denigrate the quality of the Tahoe software.
> The quality is actually much higher than the average for modern
> software, in large part due to the extensive unit tests and the
> attendant programming practices.  Unfortunately, while I'm confident
> in saying that the quality of Tahoe is much higher than the average
> for modern software, I'm still not comfortable saying that there is
> less than a 1 in 100,000 chance of a critical bug.  With further
> quality assurance processes such as code reviews, even more tests,
> operational experience, and "static" software (i.e., software that
> hasn't been changed in a long time or after a long series of quality
> assurance tests/reviews), then I could become more confident in that
> number.  Certainly the clean, well-documented, well-tested, well-
> structured source code that currently makes up Tahoe is a much better
> candidate for that kind of process than most codebases.
>
>
> Regards,
>
> Zooko
>
> [1] http://allmydata.org/pipermail/tahoe-dev/2007-September/000133.html
> [2] http://bramcohen.livejournal.com/1416.html
> [3] http://bramcohen.livejournal.com/1416.html?thread=6536#t6536
> [4] http://www.econtalk.org/archives/2007/04/taleb_on_black.html
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev

-- 
http://www.arnowaschk.de
http://arnowaschk.blogspot.com
http://www.xing.com/go/invite/7018920.9c6816