#1252 assigned defect

use different encoding parameters for dirnodes than for files

Reported by: davidsarah Owned by: davidsarah
Priority: major Milestone: undecided
Component: code-frontend Version: 1.8.0
Keywords: preservation availability dirnodes anti-censorship Cc: srl@…
Launchpad Bug:

Description (last modified by daira)

As pointed out in this tahoe-dev subthread, if you only know how to reach a given file via a sequence of dirnodes from some root, then loss of any of those dirnodes will effectively make the file unavailable.

This implies that you might want to choose the encoding parameters and happiness threshold to provide greater redundancy for dirnodes.

Change History (9)

comment:1 Changed at 2010-11-08T15:15:26Z by davidsarah

  • Description modified (diff)

comment:2 Changed at 2010-12-16T01:02:21Z by davidsarah

  • Keywords anti-censorship added

comment:3 Changed at 2011-01-06T02:00:55Z by davidsarah

  • Owner set to davidsarah
  • Status changed from new to assigned

This is too intrusive for 1.8.2, but should be a high priority for 1.9.0.

comment:4 Changed at 2011-01-06T06:45:10Z by zooko

I don't agree with the whole idea. We don't know that a given directory is more important than the files it references -- I often reference files from a directory which I also have directly linked from other directories or from my bookmarks or my personal html files or what have you. And we don't know that directories are smaller than files -- sometimes people use small files and sometimes people use large directories. I think if you have any non-negligible chance at all of losing an object then you have a problem, and the solution to that problem is not automated configuration in which some objects get more robust encoding than others. That's because the default fixed encoding for all objects should be robust enough to make probabilistic loss due to "oops I didn't have enough shares" never happen. If you're still having object loss then you have a deeper problem than "Oh I needed a more robust encoding.". I also don't like adding to the cognitive complexity for users, who already struggle mightily to understand what erasure coding means, what the default settings are, how the shares are distributed to servers, what happens when you re-upload, what happens if you change the encoding configuration, etc., etc. Having different types of objects encoded differently is only going to make it harder for them to understand and manage their grid and address their real risks of data loss.

comment:5 follow-up: Changed at 2011-01-06T19:17:32Z by warner

I guess I'm +0 on the general idea of making dirnodes more robust than the default, and -0 about the implementation/configuration complexity involved. If you have a deep directory tree, and the only path from a rootcap to a filenode is through 10 subdirectories, then your chances of recovering the file are P(recover_dirnode)10*P(recover_filenode) . We provision things to make sure that P(recover_node) is extremely high, but that x10 is a big factor, so making P(recover_dirnode) even higher isn't a bad idea.

But I agree that it's a pretty vague heuristic, and it'd be nicer to have something less uncertain, or at least some data to work from. I'd bet that most people retain a small number of rootcaps and use them to access a much larger number of files, and that making dirnodes more reliable (at the cost of more storage space) would be a good thing for 95% of the use cases. (note that folks who keep track of individual filecaps directly, like a big database or something, would not see more storage space consumed by this change).

On the "data to work from" front, it might be interesting if tahoe deep-stats built a histogram of node-depth (i.e. number of dirnodes traversed, from the root, for each file). With the exception of multiply-linked nodes and additional external rootcaps, this might give us a better notion of how much dirnode reliability affects filenode reachability.

I'll also throw in a +0 for Zooko's deeper message, which perhaps he didn't state explicitly this particular time, which is that our P(recover_node) probability is already above the it-makes-sense-to-think-about-it-further threshold: the notion that unmodeled real-world failures are way more likely than the nice-clean-(artificial) modeled all-servers-randomly-independently-fail-simultaneously failures. Once your P(failure) drops below 105 or something, any further modeling is just an act of self-indulgent mathematics.

I go back and forth on this: it feels like a good exercise to do the math and build a system with a theoretical failure probability low enough that we don't need to worry about it, and to keep paying attention to that theoretical number when we make design changes (e.g. the reason we use segmentation instead of chunking is because the math says that chunking is highly likely to fail). It's nice to be able to say that, if you have 20 servers with Poisson failure rates X and repair with frequency Y then your files will have Poisson durability Z (where Z is really good). But it's also important to remind the listener that you'll never really achieve Z because something outside the model will happen first: somebody will pour coffee into your only copy of ~/.tahoe/private/aliases, put a backhoe into the DSL line that connects you to the whole grid, or introduce a software bug into all your storage servers at the same time.

(incidentally, this is one of the big reasons I'd like to move us to a simpler storage protocol: it would allow multiple implementations of the storage server, in different languages, improving diversity and reducing the chance of simultaneous non-independent failures).

So anyways, yeah, I still think reinforcing dirnodes might be a good idea, but I have no idea how good, or how much extra expansion is appropriate, so I'm content to put it off for a while yet. Maybe 1.9.0, but I'd prioritize it lower than most of the other 1.9.0-milestone projects I can think of.

comment:6 in reply to: ↑ 5 Changed at 2011-01-06T22:09:53Z by swillden

Replying to warner:

I'll also throw in a +0 for Zooko's deeper message, which perhaps he didn't state explicitly this particular time, which is that our P(recover_node) probability is already above the it-makes-sense-to-think-about-it-further threshold: the notion that unmodeled real-world failures are way more likely than the nice-clean-(artificial) modeled all-servers-randomly-independently-fail-simultaneously failures. Once your P(failure) drops below 105 or something, any further modeling is just an act of self-indulgent mathematics.

I have to disagree with this, both with Zooko's more generic message and your formulation of it.

Tahoe-LAFS files do NOT have reliabilities above the it-makes-sense-to-think-about-it level. In fact, for some deployment models, Tahoe-LAFS default encoding parameters provide insufficient reliability for practical real-world needs, even ignoring extra-model events.

This fact was amply demonstrated by the problems observed at Allmydata.com. Individual file reliabilities may appear astronomical, but it isn't individual file reliabilities that matter. We're going to be unhappy if ANY files are lost.

When the number of shares N is much smaller than the number of servers in the grid (as was the case at allmydata.com) then failure of a relatively tiny number of servers will destroy files with shares on all of those servers. Given a large enough server set, and enough files, it becomes reasonable to treat each file's survivability as independent and multiply them all to compute the probability of acceptable file system performance -- which means that the probability of the user perceiving a failure isn't just pd, it's (roughly) pt, where t is the total number of files the user has stored. A x10 factor is one thing, but allmydata.com was facing a factor more like x1,000 or x10,000 on a per-user basis, and an exponent of many millions (billions?) for the whole system.

Given a grid of 10 servers, what is the probability that 8 of them will be down at one time? What about a grid of 200 servers? This is the factor that kicked allmydata.com's butt, and it wasn't any sort of black swan. I'm not arguing that black swans don't happen, I'm arguing that the model say grids like allmydata.com's have inadequate reliability using 3-of-10 encoding. Then you can toss black swans on top of that.

In fact, I think for large grids you can calculate the probability of any file being lost with, say, eight servers out of action as the number of ways to choose the eight dead boxes divided by the number of ways to choose 10 storage servers for a file. Assuming 200 total servers, that calculation says that with 8 of them down, one out of every 400 files would be unavailable, and that ignores the unreachability problem due to the portion of those unavailable files that are dircaps AND it assumes uniform share distribution, where in practice I'd expect older servers to have more shares, and also to be more likely to fail.

To achieve acceptable reliability in large grids N must be increased significantly.

The simplest way to think about and model it is to set N equal to the number of storage servers. In that scenario, assuming uniform share distribution and the same K for all files, the entire contents of the grid lives or dies together and the simple single-file reliability calculation works just fine, so if you can get it up to 1-10-5 (with realistic assumptions) there's really no need to bother further, and there's certainly no need to provide different encoding parameters for dirnodes. There's little point in making sure the directories survive if all the files are gone.

If you don't want to set N that large for large grids, the other option is to accept that you have an exponent in the millions, and choose encoding parameters such that you still have acceptable predicted reliability. If you want to store 100M files, and have an aggregate survival probability of 1-10-5, you need an individual survival probability on the order of 1-10-13, minimum. Even for a thousand files you need an individual p in the neighborhood of 1-10-9.

Oh, and when calculating those probabilities it's very important not to overestimate storage server reliability. The point of erasure coding is to reduce the server reliability requirements, which means we tend to choose less-reliable hardware configurations for storage servers -- old boxes, cheap blades, etc. Assuming 99.9% availability on such hardware is foolish. I think 95% is realistic, and choose 90% to be conservative.

Luckily, in a large grid it is not necessary to increase redundancy in order to get better survival probabilities. Scaling up both K and N in equal proportions increases reliability, fairly rapidly. 9-of-30 encoding produces a per-file reliability of 1-10-16, for example.

Bringing this line of thought to bear on the question at hand: I don't think it makes much sense to change the encoding parameters for dirnodes. Assuming we choose encoding parameters such that pt is acceptable, an additional factor of pd won't make much difference, since t >> d.

Last edited at 2011-01-06T22:10:33Z by swillden (previous) (diff)

comment:7 Changed at 2011-05-28T19:30:09Z by davidsarah

  • Milestone changed from 1.9.0 to undecided

comment:8 Changed at 2013-07-23T19:25:20Z by daira

  • Description modified (diff)

I suspect part of the difference between Zooko's and my opinion on this issue is that I already see the complexity of potentially having different encoding parameters for different objects as a sunk cost. And I agree completely with "Tahoe-LAFS files do NOT have reliabilities above the it-makes-sense-to-think-about-it level."

comment:9 Changed at 2013-07-23T19:42:10Z by srl

  • Cc srl@… added

Cognitive complexity is a real issue, however I think erasure coding parameters are already complex enough that 3 more optional parameters instead of just 3 seems to be only slightly more complex. I would definitely use this, I'd set dircaps to (say) 1-of-4 while other files are 2-of-4 or 3-of-4

Note: See TracTickets for help on using tickets.