#2108 new defect

uploader should keep trying other servers if its initially-chosen servers fail during the "scan" phase

Reported by: zooko Owned by: daira
Priority: normal Milestone: soon
Component: code-peerselection Version: 1.10.0
Keywords: brians-opinion-needed regression upload servers-of-happiness Cc: markberger
Launchpad Bug:

Description (last modified by zooko)

In v1.10 upload.py, during the uploader's "scan" phase (asking storage servers if they already have, or would be willing to accept upload of, shares of this file), if the uploader's first chosen servers answer "no can do" or fail, then it will keep asking more and more servers, until it either succeeds at uploading or runs out of candidates.

In 1382-rewrite-2 upload.py (which is hopefully going to be merged into trunk soon, and released in the upcoming Tahoe-LAFS v1.11), it instead chooses a few servers that it is going to ask, and if all of them fail then it gives up on the upload.

(I have a vague memory of discussing this on a conference call with the other Google Summer of Code Mentors and Mark Berger, and telling him to go ahead and do it this way, as it is simpler to implement. That might be a false memory.)

Anyway, I'd like to revisit this issue. For some situations, this would be a regression from 1.10 to 1.11, i.e. 1.10 would successfully upload and 1.11 would say that the upload failed. Therefore I'm adding the keywords "regression" and "blocks-release" to this ticket.

The reason to do it this way with a finite "scan" phase is that by first establishing which servers either already-have or are-willing-to-accept shares, we can then use our upload-strategy-of-happiness computation to plan which servers we want to upload to. Mixing planning with action is confusing, and the old 1.10 algorithm was hard to understand and has some undesirable behaviors. I suspect this is why we instructed Mark to go ahead with the simpler "phased" approach in #1382.

However, now that I've seen the 1382-rewrite-2 branch up close, I think I'm starting to see how a variant of it wouldn't be too complicated, would have the property of "always achieves Happiness if it is possible to do so" and would relieve it of this regression.

The idea would be that instead of:

  1. Pick a few servers (e.g. 2*N).
  2. Query them all about their state (whether they already have any shares and whether they would be willing to hold a share).
  3. Wait til all queries resolve (by a response, a failure or network disconnect, or a timeout).
  4. Run the "calculate-a-plan-for-happiness" algorithm.
    1. If it is possible to achieve happiness, go on to the next phase of attempting to implement that plan (i.e. the uploading-shares phase).

We would instead have a state machine which does something like this:

  1. Let R be the set of servers who have responded to our queries by indicating that they either already have shares or would be willing to hold a share. At the beginning of the state machine, R is (the empty set). Let A be the set of all servers that we have heard about.
  2. Run the "calculate-a-plan-for-happiness" algorithm on R.
    1. If it is possible to achieve happiness, go on to the next phase of attempting to implement the plan (i.e. the uploading-shares phase).
    2. If it is not possible, then pop the next server out of A and send it a query. When the query comes back, go to step 2.

Change History (10)

comment:1 Changed at 2013-11-15T21:36:33Z by zooko

  • Keywords brians-opinion-needed added

comment:2 Changed at 2013-11-15T21:37:32Z by zooko

  • Description modified (diff)

comment:3 Changed at 2013-12-12T04:58:47Z by zooko

  • Milestone changed from 1.11.0 to soon

Daira said, at some point, I think, that this is not an important regression for Tahoe-LAFS v1.11, because the only cases where this would occur are when the grid has a high rate of churn (servers coming and going), and in those cases, Tahoe-LAFS v1.10 immutable upload probably has other problems. I think that's what she said. Anyway, it sounded right to me at the time and I agreed with it, but apparently we forgot to write it down on this ticket. Assigning to Daira to confirm and moving this ticket out of v1.11.

comment:4 Changed at 2013-12-12T15:56:02Z by daira

I said I didn't think it should be a blocker for 1.11 (not that it wasn't important). Zooko accurately described my reasoning.

comment:5 Changed at 2013-12-12T15:56:20Z by daira

  • Milestone changed from soon to 1.12.0

comment:6 Changed at 2013-12-12T15:57:57Z by daira

  • Component changed from unknown to code-peerselection
  • Keywords blocks-release removed

comment:7 Changed at 2016-03-22T05:02:25Z by warner

  • Milestone changed from 1.12.0 to 1.13.0

Milestone renamed

comment:8 Changed at 2016-06-28T18:17:14Z by warner

  • Milestone changed from 1.13.0 to 1.14.0

renaming milestone

comment:9 Changed at 2020-06-30T14:45:13Z by exarkun

  • Milestone changed from 1.14.0 to 1.15.0

Moving open issues out of closed milestones.

comment:10 Changed at 2021-03-30T18:40:19Z by meejah

  • Milestone changed from 1.15.0 to soon

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.