[tahoe-lafs-trac-stream] [tahoe-lafs] #546: mutable-file surprise shares raise inappropriate UCWE

Tue Nov 13 23:27:36 UTC 2012

#546: mutable-file surprise shares raise inappropriate UCWE
------------------------------+--------------------------------------
     Reporter:  warner        |      Owner:
         Type:  defect        |     Status:  new
     Priority:  normal        |  Milestone:  soon
    Component:  code-mutable  |    Version:  1.2.0
   Resolution:                |   Keywords:  availability upload ucwe
Launchpad Bug:                |
------------------------------+--------------------------------------
Changes (by zooko):

 * priority:  critical => normal

Old description:

> We added a "Thumper" to the allmydata.com prodnet today: 47 new nodes,
> bringing the grid to a total of 111 nodes. Shortly afterwards, our
> "trunk-prodnet" automated grid-tester failed:
> http://allmydata.org/buildbot/builders/gutsy/builds/1195 . There were a
> couple of different problems here, at least three tickets worth. This
> ticket
> is about a problem that is revealed by the addition of a large number of
> nodes.
>
> First, a quick summary of the problems:
>
>  * problem 1: mapupdate(MODE_WRITE) triggers on 1000, when it should use
> 1000$
>    (to avoid triggering on 10001) (ticket #547)
>
>  * problem 2: the mapupdate can hit a false boundary because the inserted
> gap
>    was too large. It might be a good idea to increase epsilon for
> MODE_WRITE
>    to reduce the chance of this. (ticket #549)
>
>  * problem 3: when mapupdate hits a false boundary (because of either of
> the
>    above problems), the subsequent publish may put shares to servers
> which
>    actually already have them. The code looks for these "surprise" shares
> and
>    raises UCWE if it sees them. The UCWE is probably inappropriate,
> instead a
>    new writev should be sent to update the old share. (this ticket, #546)
>
>  * problem 4: delete() which hits UCWE (because of problem 3, or other
> issues)
>    will retry, but will fail the second time because must_exist=True is
> the
>    default (ticket #550)
>

> Now, some background. When a directory or mutable file is created, gets a
> public key, and from that it gets a storage index. The storage index
> defines
> a permutation of storage servers: a sequence of serverids. A permutation
> function is used which accepts as inputs a set of serverids and the
> storage
> index, and it produces an ordered list of serverids (the same size as the
> input set). The partial ordering of the serverids is always the same for
> any
> given storage index. This means that adding new servers (i.e. re-running
> the
> function with a larger set) will insert new serverids into the output
> list at
> various places, but it will not change the relative ordering.
>
> A new mutable file's shares are placed on the first N (=10) writeable
> servers
> in permuted order. Later, when the file is subsequently read or written,
> a
> "share map update" operation is performed, to take the then-current list
> of
> servers and build up a partial map of which shares are on which servers.
> This
> mapupdate operation has several different modes, depending upon what
> needs to
> be done with the servermap: MODE_READ is for download-only, MODE_WRITE is
> for
> read-modify-write, and MODE_CHECK is for file-checking and repair
> operations.
>
> The mapupdate operation is designed to balance efficiency, speed, and
> safety.
> Efficiency is served by querying as few servers as possible. Speed is
> served
> mainly by minimizing round trips. Safety is a question of improving
> consistency: there may be shares of various versions present on the
> servers,
> and we desire to have a pretty good chance of seeing the "current"
> version,
> and we want to update "most" shares when we modify the file.
>
> In MODE_READ, the mapupdate operation queries servers, in permuted order,
> until it has received evidence of k (=3) plus "epsilon" (=k) shares of
> the
> highest version, and has not seen evidence of any competing versions
> (those
> with a higher sequence number). This is enough to retrieve the file and
> also
> to reduce the possibility of retrieving an old version (which, although
> annoying, would not really threaten data-loss, because MODE_READ is not
> used
> for directory modifications). Without the epsilon, a small number of old
> shares would be enough to trick the operation into not looking for a
> newer
> version (equivalently, a small group of colluding servers could
> accomplish a
> rollback attack).
>
> In MODE_WRITE, the mapupdate operation needs to be more thorough, for two
> reasons. The first is that MODE_WRITE is for modification, which means
> somebody is going to take the version that is returned and build a new
> version on top of it (by modifying the contents). When that new version
> is
> published, it will (ideally) completely replace any old versions. So if
> the
> version that gets retrieved is not the current one, we'll lose the
> contents
> of the current version, which will be a regression in the state of the
> file.
> The second reason is that the eventual publish needs to update all
> shares, to
> reduce the chance of leaving around shares of old versions (and thus
> improving the safety of later operations).
>
> So MODE_WRITE must try harder to locate all shares: it tries to strike a
> different balance between efficiency, speed, and safety. In this mode,
> the
> code is supposed to query servers in permuted order until:
>
>  1. it can recover some version of the file
>  2. it sees no evidence of older versions of the file
>  3. it has seen enough contiguous "I have no share for you" answers that
> we
>     can conclude we've searched beyond the active set of servers.
>     Specifically, the code looks for a range of servers in which the
> first
>     server has a share, and the following epsilon (=k) servers do not
> have a
>     share, then it declares the search complete.
>  4. or, it runs out of servers to query
>
> There is a bug in the current implementation of MODE_WRITE, in which the
> last
> condition (a "boundary" between the active set and the remaining unused
> servers) is incorrectly triggered (the pattern-match looks for "1000" but
> it
> should really look for "1000$", to avoid matching on "10001"). But that
> is in
> a different ticket (#547).
>
> This ticket is about the fact that the publish process doesn't keep track
> of
> which shares it's sent to whom, so if it needs to make a second pass (to
> place shares that were not accepted by their first destination), it will
> get
> confused and signal an inappropriate UCWE.
>
> When a single new server is added, it is effectively inserted at a random
> location into the permuted peer list. The list that used to be "A B C D
> E"
> becomes "A B new C D E". The mapupdate operation should tolerate this.
> Ideally, some kind of rebalancing or repair operation should move the
> shares
> up in the server list, continually moving the shares to their "ideal"
> location (the one that improves the efficiency, speed, and safety of
> publish/retrieve operations).
>
> The publish operation will leave shares in place whenever possible. (it
> currently has no mechanism to remove extra shares, and is obligated to
> modify
> all shares in place, so it won't intentionally duplicate a share). So if
> we
> pretend that N=5 and the file was first created on "A1 B2 C3 D4 E5" (i.e.
> server A has share 1, server B has share 2, etc), then when "new" is
> inserted, the map will look like "A1 B2 new(none) C3 D4 E5", and no share
> will be placed on "new".
>
> If lots of new servers are inserted, such that the map looks like "A B C
> new
> new new D E", then the MODE_WRITE mapupdate may conclude that the active
> set
> ends with server C: unless it spends the time+bandwidth to query more
> servers, it won't see the shares on D and E. (if a gap of three servers
> isn't
> convincing, imagine that the gap had 100 servers in it). In this case,
> the
> mapupdate will stop after the sixth query, and report shares at "A1 B2 C3
> new(none) new(none) new(none)". When the subsequent publish takes place,
> it
> needs to put shares 4 and 5 somewhere, so it will put them on the first
> two
> new servers, and the visible map will wind up as "A1 B2 C3 new4 new5
> new(none)". In fact, the full map will be "A1 B2 C3 new4 new5 new(none)
> D(old4) E(old5)", since D and E would not have been updated.
>
> In effect, the shares have been migrated closer to the start of the
> permuted
> list, and a few old shares will get leftover towards the end of the list.
> As
> long as the new servers stick around, the old shares will never be seen.
> Some
> day, we'll have a repairer or a share-migrator service that will spot
> these
> and delete them, but for now they're fairly harmless. (if enough new
> servers
> disappear, then the old shares could be dangerous, since they might cause
> rollback). This process may happen multiple times, as new batches of
> servers
> are added.
>
> Now, if a medium number of new servers are added, the "unhoused shares"
> (4
> and 5 in the above example) might be assigned to servers which have not
> been
> queried. If we use N=7 and start with "A1 B2 C3 new new new D4 E5 F6 G7",
> then the mapupdate code would stop at "A1 B2 C3 new new new", and the
> publish
> code would need to find homes for shares 4-7. The three new servers would
> get
> shares 4 5 and 6, and share 7 would be assigned to the pre-existing (but
> unqueried) server D. The publish code makes the dubious assumption that
> if
> server D wasn't queried, then server D does not have any shares.
>
> As a result, a readv-and-testv-and-writev request is sent to server D
> that
> asks it to accept the new version of share 7 (but only if it did not
> already
> have a copy of share 7). The readv portion of the request reveals that
> server
> D has a copy (now old) of share 4. The publish code responds to this
> "surprise share" by raising !UncoordinatedWriteError, since normally
> (i.e.
> when a server was queried and reported having no share, then a
> tv-a-rv-a-wv
> reports yes having a share) this indicates that someone else has written
> a
> share in the interim.
>
> This UCWE is "problem 3" above. If the mapupdate had queried this server,
> it
> would have learned about share 4, and it would have done two things
> differently: it would update D with the new version of share 4, and it
> would
> not have sent share 4 to one of the new servers.
>
> The trick is how to give it the opportunity to learn about this share
> while
> still achieving the desired speed and efficiency. Maybe the rule should
> be
> that no tv-a-rv-a-wv requests shall ever be sent to a server that hasn't
> already been queried (i.e. require two round trips, the readv of which
> might
> either be done in the mapupdate or in the publish). Another possibility
> is to
> do it reactively: instead of throwing UCWE when we see the surprise
> share,
> just send out another query to replace it.
>
> Reducing the likelihood of this situation can help, but won't remove the
> issue entirely. Even if the MODE_WRITE mapupdate used an epsilon equal to
> N
> (so that we were sure we'd have answers from N servers, in the hopes of
> not
> having to walk beyond the end of the set we've already queried), some of
> those servers might not be able to accept the write, forcing us off into
> uncharted territory.

New description:

 We added a "Thumper" to the allmydata.com prodnet today: 47 new nodes,
 bringing the grid to a total of 111 nodes. Shortly afterwards, our
 "trunk-prodnet" automated grid-tester failed:
 http://allmydata.org/buildbot/builders/gutsy/builds/1195 . There were a
 couple of different problems here, at least three tickets worth. This
 ticket
 is about a problem that is revealed by the addition of a large number of
 nodes.

 First, a quick summary of the problems:

  * problem 1: mapupdate(MODE_WRITE) triggers on 1000, when it should use
 1000$
    (to avoid triggering on 10001) (ticket #547)

  * problem 2: the mapupdate can hit a false boundary because the inserted
 gap
    was too large. It might be a good idea to increase epsilon for
 MODE_WRITE
    to reduce the chance of this. (ticket #549)

  * problem 3: when mapupdate hits a false boundary (because of either of
 the
    above problems), the subsequent publish may put shares to servers which
    actually already have them. The code looks for these "surprise" shares
 and
    raises UCWE if it sees them. The UCWE is probably inappropriate,
 instead a
    new writev should be sent to update the old share. (this ticket, #546)

  * problem 4: delete() which hits UCWE (because of problem 3, or other
 issues)
    will retry, but will fail the second time because must_exist=True is
 the
    default (ticket #550)

 Now, some background. When a directory or mutable file is created, gets a
 public key, and from that it gets a storage index. The storage index
 defines
 a permutation of storage servers: a sequence of serverids. A permutation
 function is used which accepts as inputs a set of serverids and the
 storage
 index, and it produces an ordered list of serverids (the same size as the
 input set). The partial ordering of the serverids is always the same for
 any
 given storage index. This means that adding new servers (i.e. re-running
 the
 function with a larger set) will insert new serverids into the output list
 at
 various places, but it will not change the relative ordering.

 A new mutable file's shares are placed on the first N (=10) writeable
 servers
 in permuted order. Later, when the file is subsequently read or written, a
 "share map update" operation is performed, to take the then-current list
 of
 servers and build up a partial map of which shares are on which servers.
 This
 mapupdate operation has several different modes, depending upon what needs
 to
 be done with the servermap: MODE_READ is for download-only, MODE_WRITE is
 for
 read-modify-write, and MODE_CHECK is for file-checking and repair
 operations.

 The mapupdate operation is designed to balance efficiency, speed, and
 safety.
 Efficiency is served by querying as few servers as possible. Speed is
 served
 mainly by minimizing round trips. Safety is a question of improving
 consistency: there may be shares of various versions present on the
 servers,
 and we desire to have a pretty good chance of seeing the "current"
 version,
 and we want to update "most" shares when we modify the file.

 In MODE_READ, the mapupdate operation queries servers, in permuted order,
 until it has received evidence of k (=3) plus "epsilon" (=k) shares of the
 highest version, and has not seen evidence of any competing versions
 (those
 with a higher sequence number). This is enough to retrieve the file and
 also
 to reduce the possibility of retrieving an old version (which, although
 annoying, would not really threaten data-loss, because MODE_READ is not
 used
 for directory modifications). Without the epsilon, a small number of old
 shares would be enough to trick the operation into not looking for a newer
 version (equivalently, a small group of colluding servers could accomplish
 a
 rollback attack).

 In MODE_WRITE, the mapupdate operation needs to be more thorough, for two
 reasons. The first is that MODE_WRITE is for modification, which means
 somebody is going to take the version that is returned and build a new
 version on top of it (by modifying the contents). When that new version is
 published, it will (ideally) completely replace any old versions. So if
 the
 version that gets retrieved is not the current one, we'll lose the
 contents
 of the current version, which will be a regression in the state of the
 file.
 The second reason is that the eventual publish needs to update all shares,
 to
 reduce the chance of leaving around shares of old versions (and thus
 improving the safety of later operations).

 So MODE_WRITE must try harder to locate all shares: it tries to strike a
 different balance between efficiency, speed, and safety. In this mode, the
 code is supposed to query servers in permuted order until:

  1. it can recover some version of the file
  2. it sees no evidence of older versions of the file
  3. it has seen enough contiguous "I have no share for you" answers that
 we
     can conclude we've searched beyond the active set of servers.
     Specifically, the code looks for a range of servers in which the first
     server has a share, and the following epsilon (=k) servers do not have
 a
     share, then it declares the search complete.
  4. or, it runs out of servers to query

 There is a bug in the current implementation of MODE_WRITE, in which the
 last
 condition (a "boundary" between the active set and the remaining unused
 servers) is incorrectly triggered (the pattern-match looks for "1000" but
 it
 should really look for "1000$", to avoid matching on "10001"). But that is
 in
 a different ticket (#547).

 This ticket is about the fact that the publish process doesn't keep track
 of
 which shares it's sent to whom, so if it needs to make a second pass (to
 place shares that were not accepted by their first destination), it will
 get
 confused and signal an inappropriate UCWE.

 When a single new server is added, it is effectively inserted at a random
 location into the permuted peer list. The list that used to be "A B C D E"
 becomes "A B new C D E". The mapupdate operation should tolerate this.
 Ideally, some kind of rebalancing or repair operation should move the
 shares
 up in the server list, continually moving the shares to their "ideal"
 location (the one that improves the efficiency, speed, and safety of
 publish/retrieve operations).

 The publish operation will leave shares in place whenever possible. (it
 currently has no mechanism to remove extra shares, and is obligated to
 modify
 all shares in place, so it won't intentionally duplicate a share). So if
 we
 pretend that N=5 and the file was first created on "A1 B2 C3 D4 E5" (i.e.
 server A has share 1, server B has share 2, etc), then when "new" is
 inserted, the map will look like "A1 B2 new(none) C3 D4 E5", and no share
 will be placed on "new".

 If lots of new servers are inserted, such that the map looks like "A B C
 new
 new new D E", then the MODE_WRITE mapupdate may conclude that the active
 set
 ends with server C: unless it spends the time+bandwidth to query more
 servers, it won't see the shares on D and E. (if a gap of three servers
 isn't
 convincing, imagine that the gap had 100 servers in it). In this case, the
 mapupdate will stop after the sixth query, and report shares at "A1 B2 C3
 new(none) new(none) new(none)". When the subsequent publish takes place,
 it
 needs to put shares 4 and 5 somewhere, so it will put them on the first
 two
 new servers, and the visible map will wind up as "A1 B2 C3 new4 new5
 new(none)". In fact, the full map will be "A1 B2 C3 new4 new5 new(none)
 D(old4) E(old5)", since D and E would not have been updated.

 In effect, the shares have been migrated closer to the start of the
 permuted
 list, and a few old shares will get leftover towards the end of the list.
 As
 long as the new servers stick around, the old shares will never be seen.
 Some
 day, we'll have a repairer or a share-migrator service that will spot
 these
 and delete them, but for now they're fairly harmless. (if enough new
 servers
 disappear, then the old shares could be dangerous, since they might cause
 rollback). This process may happen multiple times, as new batches of
 servers
 are added.

 Now, if a medium number of new servers are added, the "unhoused shares" (4
 and 5 in the above example) might be assigned to servers which have not
 been
 queried. If we use N=7 and start with "A1 B2 C3 new new new D4 E5 F6 G7",
 then the mapupdate code would stop at "A1 B2 C3 new new new", and the
 publish
 code would need to find homes for shares 4-7. The three new servers would
 get
 shares 4 5 and 6, and share 7 would be assigned to the pre-existing (but
 unqueried) server D. The publish code makes the dubious assumption that if
 server D wasn't queried, then server D does not have any shares.

 As a result, a readv-and-testv-and-writev request is sent to server D that
 asks it to accept the new version of share 7 (but only if it did not
 already
 have a copy of share 7). The readv portion of the request reveals that
 server
 D has a copy (now old) of share 4. The publish code responds to this
 "surprise share" by raising !UncoordinatedWriteError, since normally (i.e.
 when a server was queried and reported having no share, then a
 tv-a-rv-a-wv
 reports yes having a share) this indicates that someone else has written a
 share in the interim.

 This UCWE is "problem 3" above. If the mapupdate had queried this server,
 it
 would have learned about share 4, and it would have done two things
 differently: it would update D with the new version of share 4, and it
 would
 not have sent share 4 to one of the new servers.

 The trick is how to give it the opportunity to learn about this share
 while
 still achieving the desired speed and efficiency. Maybe the rule should be
 that no tv-a-rv-a-wv requests shall ever be sent to a server that hasn't
 already been queried (i.e. require two round trips, the readv of which
 might
 either be done in the mapupdate or in the publish). Another possibility is
 to
 do it reactively: instead of throwing UCWE when we see the surprise share,
 just send out another query to replace it.

 Reducing the likelihood of this situation can help, but won't remove the
 issue entirely. Even if the MODE_WRITE mapupdate used an epsilon equal to
 N
 (so that we were sure we'd have answers from N servers, in the hopes of
 not
 having to walk beyond the end of the set we've already queried), some of
 those servers might not be able to accept the write, forcing us off into
 uncharted territory.

--

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/546#comment:12>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage