[tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Fri Mar 25 09:22:33 PDT 2011

On 9/24/10 12:36 AM, Zooko O'Whielacronx wrote:
> 
> The largest Tahoe-LAFS grid that has ever existed as far as I know was
> the allmydata.com grid. It had about 200 storage servers, where each
> one was a userspace process which had exclusive access to one spinning
> disk. Each disk was 1.0 TB except a few were 1.5 TB, so the total raw
> capacity of the grid was a little more than 200 TB. It used K=3, M=10
> erasure coding so the cooked capacity was around 70 GB.

Actually, I think the Thumper was filled with 500GB disks.. the total
raw capacity peaked at something like 140TB.

(BTW, we use "N" to mean number-of-total-shares, right?)

> The way that it failed to scale up was like this: suppose you use K=3,
> H=7, M=10 erasure-coding. Then the more nodes in the system the more
> likely it is to incur a simultaneous outage of 5 different nodes
> (H-K+1), which *might* render some files and directories unavailable.

I think the main mistake *I* made was in giving priority to upload
availability, which meant performing the upload even if there weren't
very many servers available. My assumption in the reliability math was
that a well-run grid would see each share go to a separate server, so
N-K is the number of failures you can tolerate.

We picked N=10 because that allowed N-K to be comfortably large, and
because it didn't add too much overhead (ext2 and other local
filesystems quantize sharefiles to a single disk block, which frequently
means 2KB per share, so small files suffer a large O(N) overhead). And
we made the share-placement algorithm tolerate small grids by wrapping
around, resulting in multiple shares per server when necessary.

We never came up with a good way to tell a Tahoe client something like
"you see 5 servers now, but you're supposed to be seeing 10 servers, so
don't let anybody upload anything until you see at least 9 of them".
(it's one of those explicit-NAK versus implicit timeout kinds of
things). Perhaps if we'd started with explicit server lists and added
the Introducer second, this sort of thing might have gone differently).

> (That's because some files or directories might be on only H=7
> different nodes. The failure of 8 nodes (M-K+1) will *definitely*
> render some files or directories unavailable.) In the allmydata.com

Our original design assumption was always one-share-per-server, and I
believe that this was always the case on the AMD grid. We spun up more
capacity when the number of remaining non-full servers started dropping
close to the N=10 threshold, and I think we managed to stay ahead of
that curve until the very end.

> See the problem? If you scale up the size of your grid in terms of
> servers *and* the size of your filesystem in terms of how many
> directories you have to traverse through in order to find something
> you want then you will eventually reach a scale where all or most of
> the things that you want are unreachable all or most of the time. This
> is what happened to allmydata.com, when their load grew and their
> financial and operational capacity shrank so that they couldn't
> replace dead hard drives, add capacity, and run deep-check-and-repair
> and deep-add-lease.

I don't think I'd characterize it quite that way:

 * load was relatively constant: we weren't exactly facing throngs of
   new customers
 * as much as the reliability-math would like it, we weren't seeing a
   slow consistent Poisson-like series of hard drive deaths. We have a
   total of four dead drives over the two-ish years of the tahoe
   prodgrid, all of which were in the same chassis, plus I think one or
   two motherboards that didn't boot for about a week and then were
   fixed or replaced.
 * the "financial capacity" shrank in an abrupt non-linear fashion. When
   we moved out of colo and transferred the drives to Peter's
   Undisclosed Location, I think he only had enough money/power to run
   maybe 10% of them.. it was a lot worse than just maybe 20% or 30%.

> All of the data that its customers entrusted to it is still in
> existence on those 200 disks, but to access *almost any* of that data
> requires getting *almost all* of those disks spinning at the same
> time, which is beyond allmydata.com's financial and operational
> capacity right now.

There's an interesting fairness question in there. Would it have been
better to allocate each customer to a specific shard (perhaps a set of
20 drives, spread as far across hosts and chassis as possible), in
anticipation of a downsizing that let us retain the data but not the
ability to make it all available at the same time? If we'd done that, we
could arrange to spin up one shard each day, and tell Alice that she can
get her data on Monday, Bob gets his on tuesday, etc.

 (we've also talked about whether it's better to spread damage uniformly
 across lots of files, or to allow it to be concentrated on specific
 files; a closely related question is how much correlation to allow
 between the share-placement of a file and its parent directory chain.)

> So, I have to face the fact that we designed a system that is
> fundamentally non-scalable in this way. Hard to swallow.

Could you define the kind of scaling property you're talking about a bit
more precisely? Maybe something like "it should be easy to remove
servers from a full grid (and thus reduce reliability) without losing
data"? We could call it the "anticipate corporate downsizing" property
:-).

> On the bright side, writing this letter has shown me a solution! Set M
>> = the number of servers on your grid (while keeping K/M the same as
> it was before). So if you have 100 servers on your grid, set K=30,
> H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of
> servers which can fail and cause any file or directory to fail.

I see two problems with that approach:

 * repair: with every file spread uniformly across all servers, all
   files degrade in lock step, so the what-needs-to-be-repaired
   algorithm needs to be probabilistic instead of having a sharp cutoff,
   otherwise you'll get a thundering herd of repair work. Each repair is
   considerably more expensive too: for k=30/N=100, replacing just one
   share requires the involvement of 31 servers, so for a given amount
   of bandwidth/CPU allocated to repair, you can repair 7.75x fewer
   files than you could with k=3/N=10 (which only involves 4 servers).

 * provisioning: we couldn't have afforded all of those servers up
   front, or convinced an investor to pre-purchase 140TB of disk for a
   service that, in the first few months, was only using 10TB of backend
   space. We started with 20TB in 5 machines, and every couple of
   months, as those got full, we bought a new batch (another 20TB in 5
   1U machines, four 1TB drives each). At no point in time did we ever
   have more than 40 non-full servers.

   As we discussed at the time, what we kind of wanted was a scheme in
   which constant rebalancing would fill those new empty drives with an
   appropriate fraction of the existing data: the second batch of
   servers should have been filled to 50%, the third batch to 67%, etc.
   The goal would have been to never let any one server get full, so as
   to maximize the number of non-full servers, to maximize share
   diversity. If we'd built that and managed to get it working fast
   enough, then the share-placement at any given time would be
   memory-free: there'd be no evidence that we'd ever had fewer servers
   in the past. If we'd also predicted that the grid would eventually
   grow to hundreds of machines, we could have started with 30-of-100,
   and doubled (er, 10ubled) up shares for the first few years. It's not
   clear what value of N would have been "correct" though.. should we
   have jumped directly to the 256 limit that zfec gives us?

> Please learn from our mistake: we were originally thinking of the
> beautiful combinatorial math that shows you that with K=3, M=10, with
> a "probability of success of one server" being 0.9, you get some
> wonderful "many 9's" fault-tolerance.

While I'm still eagerly looking forward to finishing your copy of The
Black Swan, I don't think that's the lesson to be learned. We never
claimed that erasure-coding would protect user data against the
highly-correlated simultaneous failure of all storage servers due to an
Act Of Investor. A lot of our engineering decisions were predicated on
having enough time (=$) to build the missing pieces before the volume
became a problem, and that predicate eventually evaluated false.

Reliable data storage is a *process*. For a given amount of money spent
on new hard drives and check/repair bandwidth+CPU, you can reduce the
probability of failure-per-unit-time due to naïvely-modeled individual
drive failure down to some rate. If you have enough machines to invoke
the Law Of Large Numbers, you might even observe such failures in
practice. But 200 drives is not large enough, and two years of
operations (using drives that claim an MTBF of 5 years or more) is not
long enough. All of the failures we observed were outside the model,
especially the last one (ENOMONEY), and we never observed an
unrecoverable failure that we had been planning to be able to recover
from.

> [*] Historical detail: allmydata.com was before Servers Of Happiness
> so they had it even worse—a file might have had all of its shares
> bunched up onto a single disk.

Yeah. In retrospect, I should have been less eager to provide
write-availability and put more energy into managing
minimum-number-of-server thresholds. A simple "don't upload unless at
least X non-full servers are present" rule might have been enough (the
new servers-of-happiness rule is a reasonable replacement for this,
albeit somewhat complex). But I don't believe we ever had much of this
problem in practice: AMD servers were all professionally hosted with
public IP addresses and clean power, etc, so I think an AMD client
writing to fewer than 10 distinct servers only ever happened during a
sub-second window at client startup as server connections raced to
complete negotiation.

cheers,
 -Brian