[tahoe-lafs-trac-stream] [tahoe-lafs] #546: mutable-file surprise shares raise inappropriate UCWE
tahoe-lafs
trac at tahoe-lafs.org
Tue Nov 13 23:27:36 UTC 2012
#546: mutable-file surprise shares raise inappropriate UCWE
------------------------------+--------------------------------------
Reporter: warner | Owner:
Type: defect | Status: new
Priority: normal | Milestone: soon
Component: code-mutable | Version: 1.2.0
Resolution: | Keywords: availability upload ucwe
Launchpad Bug: |
------------------------------+--------------------------------------
Changes (by zooko):
* priority: critical => normal
Old description:
> We added a "Thumper" to the allmydata.com prodnet today: 47 new nodes,
> bringing the grid to a total of 111 nodes. Shortly afterwards, our
> "trunk-prodnet" automated grid-tester failed:
> http://allmydata.org/buildbot/builders/gutsy/builds/1195 . There were a
> couple of different problems here, at least three tickets worth. This
> ticket
> is about a problem that is revealed by the addition of a large number of
> nodes.
>
> First, a quick summary of the problems:
>
> * problem 1: mapupdate(MODE_WRITE) triggers on 1000, when it should use
> 1000$
> (to avoid triggering on 10001) (ticket #547)
>
> * problem 2: the mapupdate can hit a false boundary because the inserted
> gap
> was too large. It might be a good idea to increase epsilon for
> MODE_WRITE
> to reduce the chance of this. (ticket #549)
>
> * problem 3: when mapupdate hits a false boundary (because of either of
> the
> above problems), the subsequent publish may put shares to servers
> which
> actually already have them. The code looks for these "surprise" shares
> and
> raises UCWE if it sees them. The UCWE is probably inappropriate,
> instead a
> new writev should be sent to update the old share. (this ticket, #546)
>
> * problem 4: delete() which hits UCWE (because of problem 3, or other
> issues)
> will retry, but will fail the second time because must_exist=True is
> the
> default (ticket #550)
>
> Now, some background. When a directory or mutable file is created, gets a
> public key, and from that it gets a storage index. The storage index
> defines
> a permutation of storage servers: a sequence of serverids. A permutation
> function is used which accepts as inputs a set of serverids and the
> storage
> index, and it produces an ordered list of serverids (the same size as the
> input set). The partial ordering of the serverids is always the same for
> any
> given storage index. This means that adding new servers (i.e. re-running
> the
> function with a larger set) will insert new serverids into the output
> list at
> various places, but it will not change the relative ordering.
>
> A new mutable file's shares are placed on the first N (=10) writeable
> servers
> in permuted order. Later, when the file is subsequently read or written,
> a
> "share map update" operation is performed, to take the then-current list
> of
> servers and build up a partial map of which shares are on which servers.
> This
> mapupdate operation has several different modes, depending upon what
> needs to
> be done with the servermap: MODE_READ is for download-only, MODE_WRITE is
> for
> read-modify-write, and MODE_CHECK is for file-checking and repair
> operations.
>
> The mapupdate operation is designed to balance efficiency, speed, and
> safety.
> Efficiency is served by querying as few servers as possible. Speed is
> served
> mainly by minimizing round trips. Safety is a question of improving
> consistency: there may be shares of various versions present on the
> servers,
> and we desire to have a pretty good chance of seeing the "current"
> version,
> and we want to update "most" shares when we modify the file.
>
> In MODE_READ, the mapupdate operation queries servers, in permuted order,
> until it has received evidence of k (=3) plus "epsilon" (=k) shares of
> the
> highest version, and has not seen evidence of any competing versions
> (those
> with a higher sequence number). This is enough to retrieve the file and
> also
> to reduce the possibility of retrieving an old version (which, although
> annoying, would not really threaten data-loss, because MODE_READ is not
> used
> for directory modifications). Without the epsilon, a small number of old
> shares would be enough to trick the operation into not looking for a
> newer
> version (equivalently, a small group of colluding servers could
> accomplish a
> rollback attack).
>
> In MODE_WRITE, the mapupdate operation needs to be more thorough, for two
> reasons. The first is that MODE_WRITE is for modification, which means
> somebody is going to take the version that is returned and build a new
> version on top of it (by modifying the contents). When that new version
> is
> published, it will (ideally) completely replace any old versions. So if
> the
> version that gets retrieved is not the current one, we'll lose the
> contents
> of the current version, which will be a regression in the state of the
> file.
> The second reason is that the eventual publish needs to update all
> shares, to
> reduce the chance of leaving around shares of old versions (and thus
> improving the safety of later operations).
>
> So MODE_WRITE must try harder to locate all shares: it tries to strike a
> different balance between efficiency, speed, and safety. In this mode,
> the
> code is supposed to query servers in permuted order until:
>
> 1. it can recover some version of the file
> 2. it sees no evidence of older versions of the file
> 3. it has seen enough contiguous "I have no share for you" answers that
> we
> can conclude we've searched beyond the active set of servers.
> Specifically, the code looks for a range of servers in which the
> first
> server has a share, and the following epsilon (=k) servers do not
> have a
> share, then it declares the search complete.
> 4. or, it runs out of servers to query
>
> There is a bug in the current implementation of MODE_WRITE, in which the
> last
> condition (a "boundary" between the active set and the remaining unused
> servers) is incorrectly triggered (the pattern-match looks for "1000" but
> it
> should really look for "1000$", to avoid matching on "10001"). But that
> is in
> a different ticket (#547).
>
> This ticket is about the fact that the publish process doesn't keep track
> of
> which shares it's sent to whom, so if it needs to make a second pass (to
> place shares that were not accepted by their first destination), it will
> get
> confused and signal an inappropriate UCWE.
>
> When a single new server is added, it is effectively inserted at a random
> location into the permuted peer list. The list that used to be "A B C D
> E"
> becomes "A B new C D E". The mapupdate operation should tolerate this.
> Ideally, some kind of rebalancing or repair operation should move the
> shares
> up in the server list, continually moving the shares to their "ideal"
> location (the one that improves the efficiency, speed, and safety of
> publish/retrieve operations).
>
> The publish operation will leave shares in place whenever possible. (it
> currently has no mechanism to remove extra shares, and is obligated to
> modify
> all shares in place, so it won't intentionally duplicate a share). So if
> we
> pretend that N=5 and the file was first created on "A1 B2 C3 D4 E5" (i.e.
> server A has share 1, server B has share 2, etc), then when "new" is
> inserted, the map will look like "A1 B2 new(none) C3 D4 E5", and no share
> will be placed on "new".
>
> If lots of new servers are inserted, such that the map looks like "A B C
> new
> new new D E", then the MODE_WRITE mapupdate may conclude that the active
> set
> ends with server C: unless it spends the time+bandwidth to query more
> servers, it won't see the shares on D and E. (if a gap of three servers
> isn't
> convincing, imagine that the gap had 100 servers in it). In this case,
> the
> mapupdate will stop after the sixth query, and report shares at "A1 B2 C3
> new(none) new(none) new(none)". When the subsequent publish takes place,
> it
> needs to put shares 4 and 5 somewhere, so it will put them on the first
> two
> new servers, and the visible map will wind up as "A1 B2 C3 new4 new5
> new(none)". In fact, the full map will be "A1 B2 C3 new4 new5 new(none)
> D(old4) E(old5)", since D and E would not have been updated.
>
> In effect, the shares have been migrated closer to the start of the
> permuted
> list, and a few old shares will get leftover towards the end of the list.
> As
> long as the new servers stick around, the old shares will never be seen.
> Some
> day, we'll have a repairer or a share-migrator service that will spot
> these
> and delete them, but for now they're fairly harmless. (if enough new
> servers
> disappear, then the old shares could be dangerous, since they might cause
> rollback). This process may happen multiple times, as new batches of
> servers
> are added.
>
> Now, if a medium number of new servers are added, the "unhoused shares"
> (4
> and 5 in the above example) might be assigned to servers which have not
> been
> queried. If we use N=7 and start with "A1 B2 C3 new new new D4 E5 F6 G7",
> then the mapupdate code would stop at "A1 B2 C3 new new new", and the
> publish
> code would need to find homes for shares 4-7. The three new servers would
> get
> shares 4 5 and 6, and share 7 would be assigned to the pre-existing (but
> unqueried) server D. The publish code makes the dubious assumption that
> if
> server D wasn't queried, then server D does not have any shares.
>
> As a result, a readv-and-testv-and-writev request is sent to server D
> that
> asks it to accept the new version of share 7 (but only if it did not
> already
> have a copy of share 7). The readv portion of the request reveals that
> server
> D has a copy (now old) of share 4. The publish code responds to this
> "surprise share" by raising !UncoordinatedWriteError, since normally
> (i.e.
> when a server was queried and reported having no share, then a
> tv-a-rv-a-wv
> reports yes having a share) this indicates that someone else has written
> a
> share in the interim.
>
> This UCWE is "problem 3" above. If the mapupdate had queried this server,
> it
> would have learned about share 4, and it would have done two things
> differently: it would update D with the new version of share 4, and it
> would
> not have sent share 4 to one of the new servers.
>
> The trick is how to give it the opportunity to learn about this share
> while
> still achieving the desired speed and efficiency. Maybe the rule should
> be
> that no tv-a-rv-a-wv requests shall ever be sent to a server that hasn't
> already been queried (i.e. require two round trips, the readv of which
> might
> either be done in the mapupdate or in the publish). Another possibility
> is to
> do it reactively: instead of throwing UCWE when we see the surprise
> share,
> just send out another query to replace it.
>
> Reducing the likelihood of this situation can help, but won't remove the
> issue entirely. Even if the MODE_WRITE mapupdate used an epsilon equal to
> N
> (so that we were sure we'd have answers from N servers, in the hopes of
> not
> having to walk beyond the end of the set we've already queried), some of
> those servers might not be able to accept the write, forcing us off into
> uncharted territory.
New description:
We added a "Thumper" to the allmydata.com prodnet today: 47 new nodes,
bringing the grid to a total of 111 nodes. Shortly afterwards, our
"trunk-prodnet" automated grid-tester failed:
http://allmydata.org/buildbot/builders/gutsy/builds/1195 . There were a
couple of different problems here, at least three tickets worth. This
ticket
is about a problem that is revealed by the addition of a large number of
nodes.
First, a quick summary of the problems:
* problem 1: mapupdate(MODE_WRITE) triggers on 1000, when it should use
1000$
(to avoid triggering on 10001) (ticket #547)
* problem 2: the mapupdate can hit a false boundary because the inserted
gap
was too large. It might be a good idea to increase epsilon for
MODE_WRITE
to reduce the chance of this. (ticket #549)
* problem 3: when mapupdate hits a false boundary (because of either of
the
above problems), the subsequent publish may put shares to servers which
actually already have them. The code looks for these "surprise" shares
and
raises UCWE if it sees them. The UCWE is probably inappropriate,
instead a
new writev should be sent to update the old share. (this ticket, #546)
* problem 4: delete() which hits UCWE (because of problem 3, or other
issues)
will retry, but will fail the second time because must_exist=True is
the
default (ticket #550)
Now, some background. When a directory or mutable file is created, gets a
public key, and from that it gets a storage index. The storage index
defines
a permutation of storage servers: a sequence of serverids. A permutation
function is used which accepts as inputs a set of serverids and the
storage
index, and it produces an ordered list of serverids (the same size as the
input set). The partial ordering of the serverids is always the same for
any
given storage index. This means that adding new servers (i.e. re-running
the
function with a larger set) will insert new serverids into the output list
at
various places, but it will not change the relative ordering.
A new mutable file's shares are placed on the first N (=10) writeable
servers
in permuted order. Later, when the file is subsequently read or written, a
"share map update" operation is performed, to take the then-current list
of
servers and build up a partial map of which shares are on which servers.
This
mapupdate operation has several different modes, depending upon what needs
to
be done with the servermap: MODE_READ is for download-only, MODE_WRITE is
for
read-modify-write, and MODE_CHECK is for file-checking and repair
operations.
The mapupdate operation is designed to balance efficiency, speed, and
safety.
Efficiency is served by querying as few servers as possible. Speed is
served
mainly by minimizing round trips. Safety is a question of improving
consistency: there may be shares of various versions present on the
servers,
and we desire to have a pretty good chance of seeing the "current"
version,
and we want to update "most" shares when we modify the file.
In MODE_READ, the mapupdate operation queries servers, in permuted order,
until it has received evidence of k (=3) plus "epsilon" (=k) shares of the
highest version, and has not seen evidence of any competing versions
(those
with a higher sequence number). This is enough to retrieve the file and
also
to reduce the possibility of retrieving an old version (which, although
annoying, would not really threaten data-loss, because MODE_READ is not
used
for directory modifications). Without the epsilon, a small number of old
shares would be enough to trick the operation into not looking for a newer
version (equivalently, a small group of colluding servers could accomplish
a
rollback attack).
In MODE_WRITE, the mapupdate operation needs to be more thorough, for two
reasons. The first is that MODE_WRITE is for modification, which means
somebody is going to take the version that is returned and build a new
version on top of it (by modifying the contents). When that new version is
published, it will (ideally) completely replace any old versions. So if
the
version that gets retrieved is not the current one, we'll lose the
contents
of the current version, which will be a regression in the state of the
file.
The second reason is that the eventual publish needs to update all shares,
to
reduce the chance of leaving around shares of old versions (and thus
improving the safety of later operations).
So MODE_WRITE must try harder to locate all shares: it tries to strike a
different balance between efficiency, speed, and safety. In this mode, the
code is supposed to query servers in permuted order until:
1. it can recover some version of the file
2. it sees no evidence of older versions of the file
3. it has seen enough contiguous "I have no share for you" answers that
we
can conclude we've searched beyond the active set of servers.
Specifically, the code looks for a range of servers in which the first
server has a share, and the following epsilon (=k) servers do not have
a
share, then it declares the search complete.
4. or, it runs out of servers to query
There is a bug in the current implementation of MODE_WRITE, in which the
last
condition (a "boundary" between the active set and the remaining unused
servers) is incorrectly triggered (the pattern-match looks for "1000" but
it
should really look for "1000$", to avoid matching on "10001"). But that is
in
a different ticket (#547).
This ticket is about the fact that the publish process doesn't keep track
of
which shares it's sent to whom, so if it needs to make a second pass (to
place shares that were not accepted by their first destination), it will
get
confused and signal an inappropriate UCWE.
When a single new server is added, it is effectively inserted at a random
location into the permuted peer list. The list that used to be "A B C D E"
becomes "A B new C D E". The mapupdate operation should tolerate this.
Ideally, some kind of rebalancing or repair operation should move the
shares
up in the server list, continually moving the shares to their "ideal"
location (the one that improves the efficiency, speed, and safety of
publish/retrieve operations).
The publish operation will leave shares in place whenever possible. (it
currently has no mechanism to remove extra shares, and is obligated to
modify
all shares in place, so it won't intentionally duplicate a share). So if
we
pretend that N=5 and the file was first created on "A1 B2 C3 D4 E5" (i.e.
server A has share 1, server B has share 2, etc), then when "new" is
inserted, the map will look like "A1 B2 new(none) C3 D4 E5", and no share
will be placed on "new".
If lots of new servers are inserted, such that the map looks like "A B C
new
new new D E", then the MODE_WRITE mapupdate may conclude that the active
set
ends with server C: unless it spends the time+bandwidth to query more
servers, it won't see the shares on D and E. (if a gap of three servers
isn't
convincing, imagine that the gap had 100 servers in it). In this case, the
mapupdate will stop after the sixth query, and report shares at "A1 B2 C3
new(none) new(none) new(none)". When the subsequent publish takes place,
it
needs to put shares 4 and 5 somewhere, so it will put them on the first
two
new servers, and the visible map will wind up as "A1 B2 C3 new4 new5
new(none)". In fact, the full map will be "A1 B2 C3 new4 new5 new(none)
D(old4) E(old5)", since D and E would not have been updated.
In effect, the shares have been migrated closer to the start of the
permuted
list, and a few old shares will get leftover towards the end of the list.
As
long as the new servers stick around, the old shares will never be seen.
Some
day, we'll have a repairer or a share-migrator service that will spot
these
and delete them, but for now they're fairly harmless. (if enough new
servers
disappear, then the old shares could be dangerous, since they might cause
rollback). This process may happen multiple times, as new batches of
servers
are added.
Now, if a medium number of new servers are added, the "unhoused shares" (4
and 5 in the above example) might be assigned to servers which have not
been
queried. If we use N=7 and start with "A1 B2 C3 new new new D4 E5 F6 G7",
then the mapupdate code would stop at "A1 B2 C3 new new new", and the
publish
code would need to find homes for shares 4-7. The three new servers would
get
shares 4 5 and 6, and share 7 would be assigned to the pre-existing (but
unqueried) server D. The publish code makes the dubious assumption that if
server D wasn't queried, then server D does not have any shares.
As a result, a readv-and-testv-and-writev request is sent to server D that
asks it to accept the new version of share 7 (but only if it did not
already
have a copy of share 7). The readv portion of the request reveals that
server
D has a copy (now old) of share 4. The publish code responds to this
"surprise share" by raising !UncoordinatedWriteError, since normally (i.e.
when a server was queried and reported having no share, then a
tv-a-rv-a-wv
reports yes having a share) this indicates that someone else has written a
share in the interim.
This UCWE is "problem 3" above. If the mapupdate had queried this server,
it
would have learned about share 4, and it would have done two things
differently: it would update D with the new version of share 4, and it
would
not have sent share 4 to one of the new servers.
The trick is how to give it the opportunity to learn about this share
while
still achieving the desired speed and efficiency. Maybe the rule should be
that no tv-a-rv-a-wv requests shall ever be sent to a server that hasn't
already been queried (i.e. require two round trips, the readv of which
might
either be done in the mapupdate or in the publish). Another possibility is
to
do it reactively: instead of throwing UCWE when we see the surprise share,
just send out another query to replace it.
Reducing the likelihood of this situation can help, but won't remove the
issue entirely. Even if the MODE_WRITE mapupdate used an epsilon equal to
N
(so that we were sure we'd have answers from N servers, in the hopes of
not
having to walk beyond the end of the set we've already queried), some of
those servers might not be able to accept the write, forcing us off into
uncharted territory.
--
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/546#comment:12>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list