[tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Fri Sep 24 07:36:33 UTC 2010

On Mon, Sep 20, 2010 at 10:14 AM, Shawn Willden <shawn at willden.org> wrote:
>
> But Tahoe isn't really optimized for large grids.  I'm not
> sure how big the grid has to get before the overhead of all of the
> additional queries to place/find shares begin to cause significant
> slowdowns, but based on Zooko's reluctance to invite a lot more people
> into the volunteer grid (at least, I've perceived such a reluctance),
> I suspect that he doesn't want too many more than the couple of dozen
> nodes we have now.

The largest Tahoe-LAFS grid that has ever existed as far as I know was
the allmydata.com grid. It had about 200 storage servers, where each
one was a userspace process which had exclusive access to one spinning
disk. Each disk was 1.0 TB except a few were 1.5 TB, so the total raw
capacity of the grid was a little more than 200 TB. It used K=3, M=10
erasure coding so the cooked capacity was around 70 GB.

Most of the machines were 1U servers with four disks, a few were 2U
servers with six, eight, or twelve disks, and one was a Sun Thumper
with thirty-six disks. (Sun *gave* us that Thumper, which was only
slightly out of date at the time, just because we were a cool open
source storage startup. Those were the days.)

Based on that experience, I'm sure that Tahoe-LAFS scales up to at
least 200 nodes in the performance of queries to place/find shares.
(Actually, those queries have been significantly optimized since then
by Kevan and Brian for both upload and download of mutable files, so
modern Tahoe-LAFS should perform even better in that setting.)

However, I'm also sure that Tahoe-LAFS *failed* to scale up in a
different way, and that other failure is why I jealously guard the
secret entrance to the Volunteer Grid from passersby.

The way that it failed to scale up was like this: suppose you use K=3,
H=7, M=10 erasure-coding. Then the more nodes in the system the more
likely it is to incur a simultaneous outage of 5 different nodes
(H-K+1), which *might* render some files and directories unavailable.
(That's because some files or directories might be on only H=7
different nodes. The failure of 8 nodes (M-K+1) will *definitely*
render some files or directories unavailable.) In the allmydata.com
case, this could happen due to the simultaneous outage of any two of
the 1U servers, or any one of the bigger servers, for example [*]. In
a friendnet such as the Volunteer Grid, this would happen if we had
enough servers that occasionally their normal level of unavailability
would coincide on five or more of them at once.

Okay, that doesn't sound too good, but it isn't that bad. You could
say to yourself that at least the rate of unavailable or even
destroyed files, expressed as a fraction of the total number of files
that your grid is serving, should be low. *But* there is another
design decision that mixes with this one to make things really bad.
That is: a lot of maintenance operations like renewing leases and
checking-and-repairing files, not to mention retrieving your files for
download, work by traversing through directories stored in the
Tahoe-LAFS filesystem. Each Tahoe-LAFS directory (which is stored in a
Tahoe-LAFS file) is independently randomly assigned to servers.

See the problem? If you scale up the size of your grid in terms of
servers *and* the size of your filesystem in terms of how many
directories you have to traverse through in order to find something
you want then you will eventually reach a scale where all or most of
the things that you want are unreachable all or most of the time. This
is what happened to allmydata.com, when their load grew and their
financial and operational capacity shrank so that they couldn't
replace dead hard drives, add capacity, and run deep-check-and-repair
and deep-add-lease.

I want to emphasize the scalability aspect of this problem. Sure
anyone who has tight finances and high load can have problems, but in
this case the problem was worse the larger the scale.

All of the data that its customers entrusted to it is still in
existence on those 200 disks, but to access *almost any* of that data
requires getting *almost all* of those disks spinning at the same
time, which is beyond allmydata.com's financial and operational
capacity right now.

So, I have to face the fact that we designed a system that is
fundamentally non-scalable in this way. Hard to swallow.

On the bright side, writing this letter has shown me a solution! Set M
>= the number of servers on your grid (while keeping K/M the same as
it was before). So if you have 100 servers on your grid, set K=30,
H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of
servers which can fail and cause any file or directory to fail.

With that tweak in place, Tahoe-LAFS scales up to about 256 separate
storage server processes, each of which could have at least 2 TB (a
single SATA drive) or an arbitrarily large filesystem if you give it
something fancier like RAID, ZFS, SAN, etc.. That's pretty scalable!

Please learn from our mistake: we were originally thinking of the
beautiful combinatorial math that shows you that with K=3, M=10, with
a "probability of success of one server" being 0.9, you get some
wonderful "many 9's" fault-tolerance. This is what Nicholas Nassim
Taleb in "The Black Swan" calls The Ludic Fallacy. The Ludic Fallacy
is to think that what you really care about can be predicted with a
nice simple mathematical model.

Brian, with a little help from Shawn, developed some slightly more
complex mathematical models, such as this one that comes with
Tahoe-LAFS:

http://pubgrid.tahoe-lafs.org/reliability/

(Of course mathematical models are a great help for understanding. It
isn't a fallacy to use them; it is a fallacy to think that they are a
sufficient guide to action.)

It would be a good exercise to understand how the allmydata.com
experience fits into that "/reliability/" model or doesn't fit into
it. I'm not sure, but I think that model must fail to account for the
issues raised in this letter, because that model is "scale free"—there
is no input to the model to specify how many hard drives or how many
files (except indirectly in check_period which has to grow as the
number of files grows), but my argument above persuades me that the
architecture is not scale-free: if you add enough hard drives
(exceeding your K,H,M parameters) and enough files then it is
guaranteed to fail.

Okay, I have more to say in response to Shawn's comments about grid
management, but I think this letter is in danger of failing due to
having scaled up during the course of this evening. ;-) So I will save
some of my response for another letter.

Regards,

Zooko

http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1199# document known scaling issues

[*] Historical detail: allmydata.com was before Servers Of Happiness
so they had it even worse—a file might have had all of its shares
bunched up onto a single disk.