<div class="gmail_quote">On Fri, Sep 24, 2010 at 1:36 AM, Zooko O'Whielacronx <span dir="ltr"><<a href="mailto:zooko@zooko.com">zooko@zooko.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

On Mon, Sep 20, 2010 at 10:14 AM, Shawn Willden <<a href="mailto:shawn@willden.org">shawn@willden.org</a>> wrote:<br>


See the problem? If you scale up the size of your grid in terms of<br>

servers *and* the size of your filesystem in terms of how many<br>

directories you have to traverse through in order to find something<br>

you want then you will eventually reach a scale where all or most of<br>

the things that you want are unreachable all or most of the time.</blockquote><div><br>I recognized this problem when designing my (still-not-very-usable) backup system.  I addressed it by not using trees of tahoe dirnodes to represent the structure of the stored files.  Instead, the tree structure is flattened into a file per backup.  Of course, the difference-based approach I use for storing changes over time re-creates the problem a different way.  I addressed that by limiting the length of forward difference chains before I store a "full" copy of the current version.  I calculated the allowable chain length by computing the probability of loss.<br>

<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 

On the bright side, writing this letter has shown me a solution! Set M<br>

>= the number of servers on your grid (while keeping K/M the same as<br>

it was before). So if you have 100 servers on your grid, set K=30,<br>

H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of<br>

servers which can fail and cause any file or directory to fail.<br></blockquote><div><br>Interesting.<br><br>One of the things I discovered in my loss modeling effort was that when you're trying to optimize reliability vs expansion, it makes sense to set M=N, then compute a value for K.  Although the larger value of K produced means that more servers are required to reassemble the file, it also means that a larger number of servers can fail without losing the file.  Assuming independent failures, that provides higher reliability for a given expansion factor, or lower expansion for a given reliability level.<br>

<br>For example, assuming a per-server failure probability of 5% (over some time frame), you achieve very close to the same per-file reliability level with K=3, H=M=10 as with K=79, H=M=100.  K=30,H=M=100 gives you 60 orders of magnitude more reliability than K=3,M=H=10 (at which point we're deep into the territory where only Black Swan events that take out lots of infrastructure at once, invalidating the assumption of server failure independence, can cause losses).<br>

<br>What you're doing by using K=30, H=70, M=100 (actually, the M value is almost irrelevant, it's H that matters because we want to analyze worst-case reliability, not best) is tremendously boosting the survival probability of a single file by hugely increasing the number of servers that must fail for it to become unavailable -- from 5 to 41.  Again assuming a 5% server failure probability, this decreases the probability of loss of a specific file from 6e-6 to 4e-45.<br>

<br>Of course, even 6e-6 is small, but that's for a specific file.  As you multiple the number of files that must be available, it quickly becomes not just possible for stuff to be inaccessible, but guaranteed that lots will be unavailable.  But when you increase the per-file reliability to apparently-ludicrous levels, then you assure that even long chains of nodes will be available -- because making them unavailable requires large numbers of servers to fail, which is very unlikely as long as server failures are independent.<br>

 <br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

With that tweak in place, Tahoe-LAFS scales up to about 256 separate<br>

storage server processes, each of which could have at least 2 TB (a<br>

single SATA drive) or an arbitrarily large filesystem if you give it<br>

something fancier like RAID, ZFS, SAN, etc.. That's pretty scalable!<br></blockquote><div><br>If Tahoe can handle 100+ nodes well, then I propose that we start aggressively evangelizing the volunteer grid, trying to grow it to that level, with the recommendation that users of the grid set M to a substantial fraction of N, and scale K and H appropriately.  Of course, we don't want to get so aggressive that we encourage the addition of nodes whose owners aren't committed to keeping them available.<br>

<br>I also think we should specify that volunteer grid nodes must provide at least 200 GB of storage to be allowed to join the grid.  But that's because I want it to be a "high capacity" grid.  Others may have different goals.  Even if we specify such a minimum capacity, I don't think there's any need to eject existing nodes that are smaller.<br>

<br>Hmm.  I guess the last two paragraphs should really be posted to the volunteer grid list.<br></div></div><br>-- <br>Shawn<br>