[tahoe-lafs-trac-stream] [tahoe-lafs] #2108: uploader should keep trying other servers if its initially-chosen servers fail during the "scan" phase

Fri Nov 15 21:37:32 UTC 2013

#2108: uploader should keep trying other servers if its initially-chosen servers
fail during the "scan" phase
-------------------------+-------------------------------------------------
     Reporter:  zooko    |      Owner:  daira
         Type:  defect   |     Status:  new
     Priority:  normal   |  Milestone:  1.11.0
    Component:  unknown  |    Version:  1.10.0
   Resolution:           |   Keywords:  brians-opinion-needed regression
Launchpad Bug:           |  upload servers-of-happiness blocks-release
-------------------------+-------------------------------------------------
Description changed by zooko:

Old description:

> In
> [source:trunk/src/allmydata/immutable/upload.py?annotate=blame&rev=196bd583b6c4959c60d3f73cdcefc9edda6a38ae#L390
> v1.10 upload.py], during the uploader's "scan" phase (asking storage
> servers if they already have, or would be willing to accept upload of,
> shares of this file), if the uploader's first chosen servers answer "no
> can do" or fail, then it will keep asking more and more servers, until it
> either succeeds at uploading or runs out of candidates.
>
> In [https://github.com/markberger/tahoe-
> lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L284
> 1382-rewrite-2 upload.py] (which is hopefully going to be merged into
> trunk soon, and released in the upcoming Tahoe-LAFS v1.11), it instead
> chooses a few servers that it is going to ask, and if all of them fail
> then it gives up on the upload.
>
> (I have a vague memory of discussing this on a conference call with the
> other Google Summer of Code Mentors and Mark Berger, and telling him to
> go ahead and do it this way, as it is simpler to implement. That might be
> a false memory.)
>
> Anyway, I'd like to revisit this issue. For some situations, this would
> be a regression from 1.10 to 1.11, i.e. 1.10 would successfully upload
> and 1.11 would say that the upload failed. Therefore I'm adding the
> keywords "regression" and "blocks-release" to this ticket.
>
> The reason to do it this way with a finite "scan" phase is that by first
> establishing which servers either already-have or are-willing-to-accept
> shares, we can then use our upload-strategy-of-happiness computation to
> plan which servers we want to upload to. Mixing planning with action is
> confusing, and the old 1.10 algorithm was hard to understand and has some
> undesirable behaviors. I suspect this is why we instructed Mark to go
> ahead with the simpler "phased" approach in #1382.
>
> However, now that I've seen the 1382-rewrite-2 branch up close, I think
> I'm starting to see how a variant of it wouldn't be ''too'' complicated,
> would have the property of "always achieves Happiness if it is possible
> to do so" and would relieve it of this regression.
>
> The idea would be that instead of:
>
> 1. Pick a few servers (e.g. 2*N).
> 2. Query them all about their state (whether they already have any shares
> and whether they would be willing to hold a share).
> 3. Wait til all queries resolve (by a response, a failure or network
> disconnect, or a timeout).
> 4. Run the "calculate-a-plan-for-happiness" algorithm.
>   a. If it is possible to achieve happiness, go on to the next phase of
> attempting to implement that plan (i.e. the uploading-shares phase).
>
> We would instead have a state machine which does something like this:
>
> 1. Let {{{R}}} be the set of servers who have responded to our queries by
> indicating that they either already have shares or would be willing to
> hold a share. At the beginning of the state machine, {{{R}}} is {{{∅}}}
> (the empty set). Let {{{A}}} be the set of all servers that we have heard
> about.
> 2. Run the "calculate-a-plan-for-happiness" algorithm on {{{R}}}.
>   a. If it is possible to achieve happiness, go on to the next phase of
> attempting to implement the plan (i.e. the uploading-shares phase).
>   b. If it is not possible, then pop the next server out of {{{A}}} and
> send it a query. When the query comes back, go to step 1.

New description:

 In
 [source:trunk/src/allmydata/immutable/upload.py?annotate=blame&rev=196bd583b6c4959c60d3f73cdcefc9edda6a38ae#L390
 v1.10 upload.py], during the uploader's "scan" phase (asking storage
 servers if they already have, or would be willing to accept upload of,
 shares of this file), if the uploader's first chosen servers answer "no
 can do" or fail, then it will keep asking more and more servers, until it
 either succeeds at uploading or runs out of candidates.

 In [https://github.com/markberger/tahoe-
 lafs/blob/fa6f3cfb1e258e4427f35a7aada05c7ad2c9dd13/src/allmydata/immutable/upload.py#L284
 1382-rewrite-2 upload.py] (which is hopefully going to be merged into
 trunk soon, and released in the upcoming Tahoe-LAFS v1.11), it instead
 chooses a few servers that it is going to ask, and if all of them fail
 then it gives up on the upload.

 (I have a vague memory of discussing this on a conference call with the
 other Google Summer of Code Mentors and Mark Berger, and telling him to go
 ahead and do it this way, as it is simpler to implement. That might be a
 false memory.)

 Anyway, I'd like to revisit this issue. For some situations, this would be
 a regression from 1.10 to 1.11, i.e. 1.10 would successfully upload and
 1.11 would say that the upload failed. Therefore I'm adding the keywords
 "regression" and "blocks-release" to this ticket.

 The reason to do it this way with a finite "scan" phase is that by first
 establishing which servers either already-have or are-willing-to-accept
 shares, we can then use our upload-strategy-of-happiness computation to
 plan which servers we want to upload to. Mixing planning with action is
 confusing, and the old 1.10 algorithm was hard to understand and has some
 undesirable behaviors. I suspect this is why we instructed Mark to go
 ahead with the simpler "phased" approach in #1382.

 However, now that I've seen the 1382-rewrite-2 branch up close, I think
 I'm starting to see how a variant of it wouldn't be ''too'' complicated,
 would have the property of "always achieves Happiness if it is possible to
 do so" and would relieve it of this regression.

 The idea would be that instead of:

 1. Pick a few servers (e.g. 2*N).
 2. Query them all about their state (whether they already have any shares
 and whether they would be willing to hold a share).
 3. Wait til all queries resolve (by a response, a failure or network
 disconnect, or a timeout).
 4. Run the "calculate-a-plan-for-happiness" algorithm.
   a. If it is possible to achieve happiness, go on to the next phase of
 attempting to implement that plan (i.e. the uploading-shares phase).

 We would instead have a state machine which does something like this:

 1. Let {{{R}}} be the set of servers who have responded to our queries by
 indicating that they either already have shares or would be willing to
 hold a share. At the beginning of the state machine, {{{R}}} is {{{∅}}}
 (the empty set). Let {{{A}}} be the set of all servers that we have heard
 about.
 2. Run the "calculate-a-plan-for-happiness" algorithm on {{{R}}}.
   a. If it is possible to achieve happiness, go on to the next phase of
 attempting to implement the plan (i.e. the uploading-shares phase).
   b. If it is not possible, then pop the next server out of {{{A}}} and
 send it a query. When the query comes back, go to step 2.

--

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2108#comment:2>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage