[tahoe-dev] [tahoe-lafs] #1252: use different encoding parameters for dirnodes than for files

tahoe-lafs trac at tahoe-lafs.org
Thu Jan 6 22:09:53 UTC 2011


#1252: use different encoding parameters for dirnodes than for files
-------------------------------+--------------------------------------------
     Reporter:  davidsarah     |       Owner:  davidsarah                                        
         Type:  defect         |      Status:  assigned                                          
     Priority:  major          |   Milestone:  1.9.0                                             
    Component:  code-frontend  |     Version:  1.8.0                                             
   Resolution:                 |    Keywords:  preservation availability dirnodes anti-censorship
Launchpad Bug:                 |  
-------------------------------+--------------------------------------------

Comment (by swillden):

 Replying to [comment:5 warner]:
 > I'll also throw in a +0 for Zooko's deeper message, which perhaps he
 didn't
 > state explicitly this particular time, which is that our P(recover_node)
 > probability is already above the it-makes-sense-to-think-about-it-
 further
 > threshold: the notion that unmodeled real-world failures are way more
 likely
 > than the nice-clean-(artificial) modeled
 > all-servers-randomly-independently-fail-simultaneously failures. Once
 your
 > P(failure) drops below 10^5^ or something, any further modeling is just
 an
 > act of self-indulgent mathematics.

 I have to disagree with this, both with Zooko's more generic message and
 your formulation of it.

 Tahoe-LAFS files do NOT have reliabilities above the it-makes-sense-to-
 think-about-it level.  In fact, for some deployment models, Tahoe-LAFS
 default encoding parameters provide insufficiently reliable for practical
 real-world needs, even ignoring extra-model events.

 This fact was amply demonstrated by the problems observed at
 Allmydata.com.  Individual file reliabilities may appear astronomical, but
 it isn't individual file reliabilities that matter.  We're going to be
 unhappy if ANY files are lost.

 When the number of shares N is much smaller than the number of servers in
 the grid (as was the case at allmydata.com) then failure of a relatively
 tiny number of servers will destroy files with shares on all of those
 servers.  Given a large enough server set, and enough files, it becomes
 reasonable to treat each file's survivability as independent and multiply
 them all to compute the probability of acceptable file system performance
 -- which means that the probability of the user perceiving a failure isn't
 just p^d^, it's (roughly) p^t^, where t is the total number of files the
 user has stored.  A x^10^ factor is one thing, but allmydata.com was
 facing a factor more like x^1,000^ or x^10,000^ on a per-user basis, and
 an exponent of many millions (billions?) for the whole system.

 Given a grid of 10 servers, what is the probability that 8 of them will be
 down at one time?  What about a grid of 200 servers?  This is the factor
 that kicked allmydata.com's butt, and it wasn't any sort of black swan.
 I'm not arguing that black swans don't happen, I'm arguing that the model
 say grids like allmydata.com's have inadequate reliability using 3-of-10
 encoding.  Then you can toss black swans on top of that.

 In fact, I think for large grids you can calculate the probability of any
 file being lost with, say, eight servers out of action as the number of
 ways to choose the eight dead boxes divided by the number of ways to
 choose 10 storage servers for a file.  Assuming 200 total servers, that
 calculation says that with 8 of them down, one out of every 400 files
 would be unavailable, and that ignores the unreachability problem due to
 the portion of those unavailable files that are dircaps AND it assumes
 uniform share distribution, where in practice I'd expect older servers to
 have more shares, and also to be more likely to fail.

 To achieve acceptable reliability in large grids N must be increased
 significantly.

 The simplest way to think about and model it is to set N equal to the
 number of storage servers.  In that scenario, assuming uniform share
 distribution and the same K for all files, the entire contents of the grid
 lives or dies together and the simple single-file reliability calculation
 works just fine, so if you can get it up to 1-10^-5^ (with realistic
 assumptions) there's really no need to bother further, and there's
 certainly no need to provide different encoding parameters for dirnodes.
 There's little point in making sure the directories survive if all the
 files are gone.

 If you don't want to set N that large for large grids, the other option is
 to accept that you have an exponent in the millions, and choose encoding
 parameters such that you still have acceptable predicted reliability.  If
 you want to store 100M files, and have an aggregate survival probability
 of 1-10^-5^, you ''need'' an individual survival probability on the order
 of 1-10^-13^, minimum.  Even for a thousand files you need an individual p
 in the neighborhood of 1-10^-9^.

 Oh, and when calculating those probabilities it's very important not to
 overestimate storage server reliability.  The point of erasure coding is
 to reduce the server reliability requirements, which means we tend to
 choose less-reliable hardware configurations for storage servers -- old
 boxes, cheap blades, etc.  Assuming 99.9% availability on such hardware is
 foolish.  I think 95% is realistic, and choose 90% to be conservative.

 Luckily, in a large grid it is not necessary to increase redundancy in
 order to get better survival probabilities.  Scaling up both K and N in
 equal proportions increases reliability, fairly rapidly.  9-of-30 encoding
 produces a per-file reliability of 1-10^-16^, for example.

 Bringing this line of thought to bear on the question at hand:  I don't
 think it makes much sense to change the encoding parameters for dirnodes.
 Assuming we choose encoding parameters such that p^t^ is acceptable, an
 additional factor of p^d^ won't make much difference, since t >> d.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1252#comment:6>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage


More information about the tahoe-dev mailing list