Opened at 2008-01-26T00:25:30Z
Last modified at 2010-12-16T00:49:13Z
#287 new defect
download needs to be tolerant of lost peers — at Initial Version
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-encoding | Version: | 0.7.0 |
Keywords: | download availability performance test hang anti-censorship | Cc: | |
Launchpad Bug: |
Description
I don't have a failing unit test to prove it, but I'm fairly sure that the current code will abort a download if one of the servers we're using is lost during the download. This is a problem.
A related problem is that downloads will run at the rate of the slowest used peer, and we may be able to get significantly faster downloads by using one of the other N-k available servers. For example, if you have most of your servers in colo, but one or two is distant, then a helper which is also in colo might prefer to pull shares entirely from in-colo machines.
The necessary change should be to keep a couple of extra servers in reserve, such that used_peers is a list (sorted by preference/speed) with some extra members, rather than a minimal set of exactly 'k' undifferentiated peers.
If a block request hasn't completed within some "reasonable" amount of time (say, 2x the time of the other requests?), we should move the slow server to the bottom of the list and make a new query for that block (using a server that's not currently in use but which appears at a higher priority than the slowpoke). If the server was actually missing (and it just takes TCP a while to decide that it's gone), it will eventually go away (and the query will fail with a DeadReferenceError), in which case we'll remove it from the list altogether (which is what the current code does, modulo the newly-reopened #17 bug).
Without this, many of the client downloads in progress when we bounce a storage server will fail, which would be pretty annoying for the clients.