#2108 new defect

uploader should keep trying other servers if its initially-chosen servers fail during the "scan" phase — at Initial Version

Reported by: zooko Owned by: daira
Priority: normal Milestone: soon
Component: code-peerselection Version: 1.10.0
Keywords: brians-opinion-needed regression upload servers-of-happiness Cc: markberger
Launchpad Bug:

Description

In v1.10 upload.py, during the uploader's "scan" phase (asking storage servers if they already have, or would be willing to accept upload of, shares of this file), if the uploader's first chosen servers answer "no can do" or fail, then it will keep asking more and more servers, until it either succeeds at uploading or runs out of candidates.

In 1382-rewrite-2 upload.py (which is hopefully going to be merged into trunk soon, and released in the upcoming Tahoe-LAFS v1.11), it instead chooses a few servers that it is going to ask, and if all of them fail then it gives up on the upload.

(I have a vague memory of discussing this on a conference call with the other Google Summer of Code Mentors and Mark Berger, and telling him to go ahead and do it this way, as it is simpler to implement. That might be a false memory.)

Anyway, I'd like to revisit this issue. For some situations, this would be a regression from 1.10 to 1.11, i.e. 1.10 would successfully upload and 1.11 would say that the upload failed. Therefore I'm adding the keywords "regression" and "blocks-release" to this ticket.

The reason to do it this way with a finite "scan" phase is that by first establishing which servers either already-have or are-willing-to-accept shares, we can then use our upload-strategy-of-happiness computation to plan which servers we want to upload to. Mixing planning with action is confusing, and the old 1.10 algorithm was hard to understand and has some undesirable behaviors. I suspect this is why we instructed Mark to go ahead with the simpler "phased" approach in #1382.

However, now that I've seen the 1382-rewrite-2 branch up close, I think I'm starting to see how a variant of it wouldn't be too complicated, would have the property of "always achieves Happiness if it is possible to do so" and would relieve it of this regression.

The idea would be that instead of:

  1. Pick a few servers (e.g. 2*N).
  2. Query them all about their state (whether they already have any shares and whether they would be willing to hold a share).
  3. Wait til all queries resolve (by a response, a failure or network disconnect, or a timeout).
  4. Run the "calculate-a-plan-for-happiness" algorithm.
    1. If it is possible to achieve happiness, go on to the next phase of attempting to implement that plan (i.e. the uploading-shares phase).

We would instead have a state machine which does something like this:

  1. Let R be the set of servers who have responded to our queries by indicating that they either already have shares or would be willing to hold a share. At the beginning of the state machine, R is (the empty set). Let A be the set of all servers that we have heard about.
  2. Run the "calculate-a-plan-for-happiness" algorithm on R.
    1. If it is possible to achieve happiness, go on to the next phase of attempting to implement the plan (i.e. the uploading-shares phase).
    2. If it is not possible, then pop the next server out of A and send it a query. When the query comes back, go to step 1.

Change History (0)

Note: See TracTickets for help on using tickets.