[tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Sun Sep 26 06:10:14 UTC 2010

On Fri, Sep 24, 2010 at 1:36 AM, Zooko O'Whielacronx <zooko at zooko.com>wrote:

> On Mon, Sep 20, 2010 at 10:14 AM, Shawn Willden <shawn at willden.org> wrote:
> See the problem? If you scale up the size of your grid in terms of
> servers *and* the size of your filesystem in terms of how many
> directories you have to traverse through in order to find something
> you want then you will eventually reach a scale where all or most of
> the things that you want are unreachable all or most of the time.

I recognized this problem when designing my (still-not-very-usable) backup
system.  I addressed it by not using trees of tahoe dirnodes to represent
the structure of the stored files.  Instead, the tree structure is flattened
into a file per backup.  Of course, the difference-based approach I use for
storing changes over time re-creates the problem a different way.  I
addressed that by limiting the length of forward difference chains before I
store a "full" copy of the current version.  I calculated the allowable
chain length by computing the probability of loss.

On the bright side, writing this letter has shown me a solution! Set M
> >= the number of servers on your grid (while keeping K/M the same as
> it was before). So if you have 100 servers on your grid, set K=30,
> H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of
> servers which can fail and cause any file or directory to fail.
>

Interesting.

One of the things I discovered in my loss modeling effort was that when
you're trying to optimize reliability vs expansion, it makes sense to set
M=N, then compute a value for K.  Although the larger value of K produced
means that more servers are required to reassemble the file, it also means
that a larger number of servers can fail without losing the file.  Assuming
independent failures, that provides higher reliability for a given expansion
factor, or lower expansion for a given reliability level.

For example, assuming a per-server failure probability of 5% (over some time
frame), you achieve very close to the same per-file reliability level with
K=3, H=M=10 as with K=79, H=M=100.  K=30,H=M=100 gives you 60 orders of
magnitude more reliability than K=3,M=H=10 (at which point we're deep into
the territory where only Black Swan events that take out lots of
infrastructure at once, invalidating the assumption of server failure
independence, can cause losses).

What you're doing by using K=30, H=70, M=100 (actually, the M value is
almost irrelevant, it's H that matters because we want to analyze worst-case
reliability, not best) is tremendously boosting the survival probability of
a single file by hugely increasing the number of servers that must fail for
it to become unavailable -- from 5 to 41.  Again assuming a 5% server
failure probability, this decreases the probability of loss of a specific
file from 6e-6 to 4e-45.

Of course, even 6e-6 is small, but that's for a specific file.  As you
multiple the number of files that must be available, it quickly becomes not
just possible for stuff to be inaccessible, but guaranteed that lots will be
unavailable.  But when you increase the per-file reliability to
apparently-ludicrous levels, then you assure that even long chains of nodes
will be available -- because making them unavailable requires large numbers
of servers to fail, which is very unlikely as long as server failures are
independent.

> With that tweak in place, Tahoe-LAFS scales up to about 256 separate
> storage server processes, each of which could have at least 2 TB (a
> single SATA drive) or an arbitrarily large filesystem if you give it
> something fancier like RAID, ZFS, SAN, etc.. That's pretty scalable!
>

If Tahoe can handle 100+ nodes well, then I propose that we start
aggressively evangelizing the volunteer grid, trying to grow it to that
level, with the recommendation that users of the grid set M to a substantial
fraction of N, and scale K and H appropriately.  Of course, we don't want to
get so aggressive that we encourage the addition of nodes whose owners
aren't committed to keeping them available.

I also think we should specify that volunteer grid nodes must provide at
least 200 GB of storage to be allowed to join the grid.  But that's because
I want it to be a "high capacity" grid.  Others may have different goals.
Even if we specify such a minimum capacity, I don't think there's any need
to eject existing nodes that are smaller.

Hmm.  I guess the last two paragraphs should really be posted to the
volunteer grid list.

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20100926/08b191c9/attachment-0001.html>