#287 new defect

download: tolerate lost or missing servers

Reported by: warner Owned by:
Priority: major Milestone: eventually
Component: code-encoding Version: 0.7.0
Keywords: download availability performance test hang anti-censorship Cc:
Launchpad Bug:

Description (last modified by warner)

I don't have a failing unit test to prove it, but I'm fairly sure that the current code will abort a download if one of the servers we're using is lost during the download. This is a problem.

A related problem is that downloads will run at the rate of the slowest used peer, and we may be able to get significantly faster downloads by using one of the other N-k available servers. For example, if you have most of your servers in colo, but one or two is distant, then a helper which is also in colo might prefer to pull shares entirely from in-colo machines.

The necessary change should be to keep a couple of extra servers in reserve, such that used_peers is a list (sorted by preference/speed) with some extra members, rather than a minimal set of exactly 'k' undifferentiated peers.

If a block request hasn't completed within some "reasonable" amount of time (say, 2x the time of the other requests?), we should move the slow server to the bottom of the list and make a new query for that block (using a server that's not currently in use but which appears at a higher priority than the slowpoke). If the server was actually missing (and it just takes TCP a while to decide that it's gone), it will eventually go away (and the query will fail with a DeadReferenceError), in which case we'll remove it from the list altogether (which is what the current code does, modulo the newly-reopened #17 bug).

Without this, many of the client downloads in progress when we bounce a storage server will fail, which would be pretty annoying for the clients.

(#798 is Brian's downloader rewrite)

Attachments (12)

p1.diff.txt (9.7 KB) - added by zooko at 2010-02-01T01:32:55Z.
p2.diff.txt (9.6 KB) - added by zooko at 2010-02-01T01:56:15Z.
p3.diff.txt (9.7 KB) - added by zooko at 2010-02-01T03:15:37Z.
p4.diff.txt (12.4 KB) - added by davidsarah at 2010-02-01T03:35:39Z.
p4a.diff.txt (12.3 KB) - added by davidsarah at 2010-02-01T03:40:44Z.
accept-late-buckets.darcspatch.txt (58.8 KB) - added by zooko at 2010-02-01T03:59:43Z.
accept-late-buckets2.darcspatch.txt (59.4 KB) - added by zooko at 2010-02-01T04:07:58Z.
That patch wasn't pyflakes-clean (unused variables in the test code), here is one that is:
davidsarah-current-tree.diff.txt (6.2 KB) - added by davidsarah at 2010-02-01T05:48:52Z.
accept-late-buckets3.darcspatch.txt (60.6 KB) - added by zooko at 2010-02-01T06:11:31Z.
There were a couple of bugs in that one. Here's one with no bugs in it!
accept-late-buckets4.darcspatch.2.txt (59.7 KB) - added by zooko at 2010-02-01T06:25:27Z.
The precondition checks that we added while debugging cause the code to fail under some tests because in that case the object is a fake ReadBucketProxy not a real one, so the precondition rejects it. This patch is just like accept-late-buckets3.darcspatch.txt except without those two checks.
fast-servers-first-0.darcspatch.txt (74.7 KB) - added by zooko at 2010-02-01T06:41:27Z.
Here is a patch which adds a new feature: remember the order servers answered and use the first servers first. Tests by David-Sarah.
fast-servers-first-1.darcspatch.txt (61.1 KB) - added by zooko at 2010-02-01T07:00:22Z.

Download all attachments as: .zip

Change History (43)

comment:1 Changed at 2008-03-08T02:54:11Z by zooko

  • Milestone changed from 0.9.0 (Allmydata 3.0 final) to undecided

comment:2 Changed at 2008-08-27T01:33:34Z by warner

  • Summary changed from download needs to be tolerant of lost peers to download: tolerate lost/missing peers

I think I've identified three main problems:

  • if any servers are silently partitioned (e.g. a laptop that's been suspended, any connection that TCP hasn't realized is gone yet) when a download starts, peer-selection will not complete until that connection is finally abandoned
  • if an active server is silently partitioned during a download, the download will stall until TCP gives up on them. At that point, the download ought to resume, using some other server (if a server is actively lost during a download, such that TCP gives us a connectionLost, then the download should immediately switch to a different server.. however I need to test this more carefully).
  • the download will be as slow as the slowest active server.

The task is to fix download to:

  • start downloading segments as soon as peer-selection finds 'k' shares. get_buckets responses that arrive after segment download has begun should be added to the alternates list.
  • if any response takes more than, say, three times as long as the longest response for that segment, move the slow server to the bottom of the alternates list and start fetching a different share.

Basically the download process must turn into a state machine. Each known share has a state (which hashes have been fetched, which block queries are outstanding). The initial peer-selection process causes shares to be added to the known list.

comment:3 Changed at 2008-08-27T01:34:46Z by warner

#193 and #253 are probably related to this one

comment:4 Changed at 2008-09-22T23:04:58Z by zooko

The Allmydata.com production grid experienced this problem today, when the storage server "prodtahoe7" failed in such a way that the other nodes kept waiting indefinitely for answers to their foolscap queries to that server. At least, we think that is why the downloads hung until we turned off prodtahoe7. However, I don't understand why the downloads continued to hang after the prodtahoe7 machine was powered off, until the clients that were using prodtahoe7 (in this case the webapi nodes) were restarted.

Shouldn't the absence of prodtahoe7 at the IP level have triggered the TCP connections to break the next time the clients tried to send packets, which should have triggered the foolscap connection to break, which should have triggered the download to abort?

Ah! But then even if that happened and that download were aborted, would the next download try to use prodtahoe7 storage server nodes again, and if it did, would it wait for a long time for a TCP connection attempt?

Anyway, we need to investigate in the logs of today's events to see exactly why the webapi nodes had to be restarted, after prodtahoe7 was gone, before they would start working again.

comment:5 Changed at 2008-09-24T03:39:16Z by warner

It looks like prodtahoe7 had a RAID controller failure, or possibly several simultaneous disk failures, and got stuck in a weird way: TCP connections probably remained alive, but the Tahoe storage nodes were not responding to queries. This is pretty close to the "silent connection loss" case, but worse: TCP keepalives wouldn't tell you the connection was dead, because the prodtahoe7 kernel was still running and responding with ACKs. So the fix described above should improve application behavior in yesterday's prodtahoe7 problem, as well as in the more common close-the-laptop-and-walk-away problem.

For download, this fix means a tradeoff between the setup work (i.e. hash tree fetching) needed to start using a new share, against how long we want to wait to distinguish between a slow server and a stuck one. I don't know what sort of heuristic we should use for this: we must take into account slow links and large segments, and remember that parallel segment requests will be competing with each other.

For upload, this is another time-vs-work tradeoff, but slightly trickier. If we give up on the server early, during peer selection, then the consequences are minor: we may put the share on a non-ideal server, such that the eventual downloading client will have to search further around the ring to find the share. If we are forced to give up on the server late, we must either give up on that share (i.e. the file is now unhealthy, with perhaps 9 shares instead of 10), or restart the upload from the beginning, or spend memory (or disk) on holding all shares so that we have something to give to the replacement server. Of these choices, I think I prever giving up on the share (and scheduling a re-upload, or a repair if the original data is not available in a non-streaming place).

comment:6 Changed at 2008-09-24T13:26:33Z by zooko

See also #193 and #253 and #521.

comment:7 Changed at 2009-11-22T15:59:52Z by davidsarah

  • Keywords reliability added

comment:8 Changed at 2009-12-04T05:45:44Z by davidsarah

  • Keywords availability added; reliability removed

comment:9 Changed at 2009-12-12T04:45:15Z by zooko

  • Summary changed from download: tolerate lost/missing peers to download: tolerate lost/missing servers

comment:10 Changed at 2009-12-25T20:51:05Z by zooko

I've observed this happening quite a lot on the allmydata.com prod grid. I haven't yet figured out exactly which server is responding strangely or what that server is doing wrong, but exactly one of the (currently) 89 servers on the prod grid fails to respond to the do-you-have-shares query and causes downloads to hang. Restarting the gateway node causes it to start downloading correctly, which means that whichever server it is that is behaving badly either doesn't connect to the gateway after the gateway restarts, or it behaves better after it has reconnected to the gateway.

comment:11 Changed at 2009-12-27T00:57:30Z by davidsarah

  • Keywords upload test added
  • Summary changed from download: tolerate lost/missing servers to download/upload: tolerate lost or missing servers

This probably also affects upload, as mentioned in comment:5, but there seems to be no separate ticket for that. (#782 is possibly relevant but not confirmed to have the same cause.)

We should probably have a test that simulates a hanging server, and/or a server that disconnects.

comment:12 Changed at 2009-12-27T04:52:06Z by warner

  • Keywords upload removed
  • Summary changed from download/upload: tolerate lost or missing servers to download: tolerate lost or missing servers

I just created #873 for the upload case. Both are important, but I'd like to leave this ticket specific for the download case: the code paths and necessary implementation details are completely different.

comment:13 Changed at 2009-12-27T16:34:33Z by davidsarah

  • Keywords performance added

comment:14 Changed at 2009-12-29T19:11:17Z by davidsarah

  • Keywords hang added

comment:15 Changed at 2010-01-27T20:49:17Z by zooko

Many of the problems that I've observed which I thought were a case of this ticket have actually turned out to be a case of #928 (start downloading as soon as you know where to get K shares). That is: it was not the case that a server failed and got into a hung state during a download. (I never could understand how this problem could be so common if it required this particular timing!) Instead it was the case that if a server failed and got into a hung state then all subsequent downloads would hang. This was happening quite a lot on the allmydata.com prod grid recently because servers were experiencing MemoryError and then going into this state.

comment:16 Changed at 2010-02-01T01:31:49Z by zooko

I think that the original post was slightly imprecise. I think that download would correctly fail-over if a server disconnected during download (or if the server returned an error or if it dropped the TCP connection), but it would hang if the server stayed connected but didn't answer the requests at all. In fact, until the fix for #928 was committed, downloads would hang if there was such a stuck server on the grid at all, even if that server had been in its stuck state since before the download began and even if that server didn't have any of the shares that the download needed!

Okay, so the fix for #928 has been committed to trunk, which means that downloads now proceed even if there is a stuck server on the grid but with the current version (ea3954372a06a36c) it means that download proceeds without knowing about all the shares that are out there and currently the downloader ignores the information about shares which arrives late.

Here is a patch in unified diff form which fixes this -- making download accept and use information that arrives after "stage 4" of download has begun, and also has incomplete changes to the unit tests to deterministically exercise this case.

Changed at 2010-02-01T01:32:55Z by zooko

comment:17 Changed at 2010-02-01T01:51:56Z by zooko

Here is a version of my patch in which there is a new test named test_failover_during_stage_4. The intent of this test is: 1 Set servers 3 through 9 to the hung state. 2 Start download. 3 As soon as stage 4 of download is reached, which means that the client got responses to get_buckets from servers 0, 1, and 2, then unhang server 3 and cause server 2 to have a corrupted share. 4 Assert that download completes successfully. Oh, writing that makes me realize that server 2 might as well have the share corrupted before the download starts! I'm not sure if the currently implementation of the test will unhang server 3 before the downloader finishes downloading all the shares from server 0, 1, and 2. My intent is to test the case that the downloader does hear back from a new server, after stage 4 has begun but before stage 4 has ended. I definitely do not want to add a delay to the downloader once it runs out of buckets in the hopes that another bucket will come in. Brian is considering such tricky tactics for his post-1.6 downloader rewrite, but that's out of scope for this. David-Sarah is currently implementing a method used in this patch named _corrupt_share_in.

Changed at 2010-02-01T01:56:15Z by zooko

comment:18 Changed at 2010-02-01T03:15:21Z by zooko

Okay here's a version of the tests which I think is correct except that it doesn't have "corrupt a share" method yet (David-Sarah is contributing that).

Changed at 2010-02-01T03:15:37Z by zooko

Changed at 2010-02-01T03:35:39Z by davidsarah

Changed at 2010-02-01T03:40:44Z by davidsarah

comment:19 Changed at 2010-02-01T03:59:02Z by zooko

  • Keywords review-needed added

Okay here is a complete version including tests. Thanks to David-Sarah for helping with the tests. Please review! (It is okay for David-Sarah to be the reviewer even though they helped with the tests.)

Changed at 2010-02-01T04:07:58Z by zooko

That patch wasn't pyflakes-clean (unused variables in the test code), here is one that is:

Changed at 2010-02-01T05:48:52Z by davidsarah

Changed at 2010-02-01T06:11:31Z by zooko

There were a couple of bugs in that one. Here's one with no bugs in it!

Changed at 2010-02-01T06:25:27Z by zooko

The precondition checks that we added while debugging cause the code to fail under some tests because in that case the object is a fake ReadBucketProxy not a real one, so the precondition rejects it. This patch is just like accept-late-buckets3.darcspatch.txt except without those two checks.

Changed at 2010-02-01T06:41:27Z by zooko

Here is a patch which adds a new feature: remember the order servers answered and use the first servers first. Tests by David-Sarah.

comment:20 Changed at 2010-02-01T06:53:42Z by zooko

Committed 3e4342ecb3625899 which makes it so that downloaders accept late-arriving shares and use them. Thanks to David-Sarah for help especially with the test!

comment:21 Changed at 2010-02-01T07:00:46Z by zooko

fast-servers-first-1.darcspatch.txt doesn't pass the new test that David-Sarah wrote for it: allmydata.test.test_hung_server.HungServerDownloadTest.test_use_first_servers_to_reply, and also it causes this test to go from pass to fail:

allmydata.test.test_mutable
  Problems
    test_publish_all_servers_bad ... Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/new-preserve-order/src/allmydata/test/common_util.py", line 71, in done
    (which, expected_failure, res))
twisted.trial.unittest.FailTest: test_publish_all_servers_bad was supposed to raise <class 'allmydata.mutable.common.NotEnoughServersError'>, not get '<MutableFileNode 33b7b20 RW 2qvahvls>'
[FAIL]

===============================================================================
[FAIL]: allmydata.test.test_mutable.Problems.test_publish_all_servers_bad

Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/new-preserve-order/src/allmydata/test/common_util.py", line 71, in done
    (which, expected_failure, res))
twisted.trial.unittest.FailTest: test_publish_all_servers_bad was supposed to raise <class 'allmydata.mutable.common.NotEnoughServersError'>, not get '<MutableFileNode 33b7b20 RW 2qvahvls>'

comment:22 Changed at 2010-02-01T16:23:35Z by zooko

Okay, I plan to release v1.6 without further work on this "use the fastest servers first" patch. Brian is going to completely rewrite downloader after v1.6 -- hopefully this patch will inform his rewrite or serve as a benchmark to run against his new downloader.

comment:23 Changed at 2010-02-02T04:28:42Z by zooko

  • Keywords review-needed removed

comment:24 Changed at 2010-02-27T06:41:39Z by zooko

  • Milestone changed from eventually to 1.7.0

comment:25 Changed at 2010-04-26T09:52:20Z by warner

  • Description modified (diff)

FYI, #798 is the new downloader. It's coming along nicely. Almost passes a test or two.

comment:26 Changed at 2010-05-08T20:21:10Z by zooko

If you like this ticket, you might also like the "Brian's New Downloader" bundle of tickets: #605 (two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable), #800 (improve alacrity by downloading only the part of the Merkle Tree that you need), #809 (Measure how segment size affects upload/download speed.), #798 (improve random-access download to retrieve/decrypt less data), and #448 (download: speak to as few servers as possible).

comment:27 Changed at 2010-05-08T22:49:00Z by zooko

  • Milestone changed from 1.7.0 to 1.8.0

Brian's New Downloader is now planned for v1.8.0.

comment:28 Changed at 2010-08-10T03:46:40Z by davidsarah

New Downloader is in 1.8, but I'm unclear to what extent it addresses this ticket. I think it's a partial fix for immutable downloads, is that right?

comment:29 Changed at 2010-08-10T05:07:29Z by warner

The #798 new downloader (at least in the form that will probably appear in tahoe-1.8.0) addresses somebut not all of this ticket.

  • servers which disconnect during download: these ought to be handled perfectly: new servers will be located and spun up, necessary hashes will be retrieved, and the download should continue without a hitch
  • servers which are in a stuck state (e.g. a silent disconnect) before the download begins will be tolerated: DYHB requests to them will stall, but other servers will be queried, and the download proper will begin as soon as enough shares are located. There is a hard-coded 10 second timeout, and DYHB queries which are not answered within this time will be replaced with a new query. The downloader will allow 10 non-overdue queries to be outstanding at any given time.
  • servers which enter a stuck state after the DYHB query has been answered are not yet handled well. There is code to react to an "OVERDUE" state (by switching to new shares), but there is not yet any code to actually declare an OVERDUE state (I couldn't settle on a reasonable heuristic to distinguish between a stuck server and one that is merely slow).

The goals described in this ticket's description are still desireable:

  • have a list of peers, sorted by "goodness" (probably speed)
  • when a server hasn't responded in a while, move it to the bottom of the list
  • keep a couple of extra shares in reserve, to quickly fill in for a server that gets stuck

So we should at least keep this ticket open until the new downloader is capable of declaring an OVERDUE state and thus becomes tolerant to servers that get stuck after the DYHB queries. And probably the criteria for closing it should be the implementation of the scheme where we have a list of shares sorted by responsiveness.

comment:30 Changed at 2010-08-15T06:18:04Z by zooko

  • Milestone changed from 1.8.0 to eventually

comment:31 Changed at 2010-12-16T00:49:13Z by davidsarah

  • Keywords anti-censorship added
Note: See TracTickets for help on using tickets.