[tahoe-lafs-trac-stream] [tahoe-lafs] #1641: fix regressions in convergent uncoordinated write detection

Sun Dec 18 01:52:42 UTC 2011

#1641: fix regressions in convergent uncoordinated write detection
-------------------------+-----------------------
     Reporter:  kevan    |      Owner:  nobody
         Type:  defect   |     Status:  new
     Priority:  major    |  Milestone:  undecided
    Component:  unknown  |    Version:  1.9.0
   Resolution:           |   Keywords:
Launchpad Bug:           |
-------------------------+-----------------------
Description changed by zooko:

Old description:

> Comment 4 in ticket 546 describes a fix made to the mutable publisher to
> allow it to tolerate unknown shares in certain circumstances without
> raising an UncoordinatedWriteError. In the pre-1.9 publisher, this is
> implemented by comparing the checkstring (seqnum, roothash, salt) of the
> unexpected share with the checkstring of the file being published. If
> they're the same, then the unexpected share is assumed to be either a
> share that the publish operation placed earlier or an uncoordinated
> convergent write, and tolerated without an uncoordinated write error. The
> 1.9 publisher changes this behavior in two ways.
>
> The first change is a bug. The checkstring that the check examines is set
> on lines 296, 496, and 828 of publish.py:
>
> {{{
>         self._checkstring = self.writers.values()[0].get_checkstring()
> }}}
>
> {{{self.writes.values()[0]}}} can be an instance of either
> {{{MDMFSlotWriteProxy}}} or {{{SDMFSlotWriteProxy}}}.
> {{{MDMFSlotWriteProxy}}} returns a different checkstring than
> {{{SDMFSlotWriteProxy}}}; specifically, {{{MDMFSlotWriteProxy}}} returns
> the checkstring associated with the file version we're writing, while
> {{{SDMFSlotWriteProxy}}} returns the checkstring associated with the
> existing share (if any). Only {{{MDMFSlotWriteProxy}}} returns the
> checkstring associated with the current version of the mutable file,
> which is necessary in order for the #546 check to behave the same as in
> the pre-1.9 publisher. The fix for this issue is to change
> {{{SDMFSlotWriteProxy}}} to return the same checkstring as
> {{{MDMFSlotWriteProxy}}}.
>
> The second change is a design flaw. On line 987, I added the following:
>
> {{{
>         # We need to remove from surprise_shares any shares that we are
>         # knowingly also writing to that server from other writers.
>
>         # TODO: Precompute this.
>         known_shnums = [x.shnum for x in self.writers.values()
>                         if x.server == server]
>         surprise_shares -= set(known_shnums)
>         self.log("found the following surprise shares: %s" %
>                  str(surprise_shares))
> }}}
>
> which essentially exempts any surprise share that we know we're supposed
> to be writing during the publish operation from the #546 check.  The 1.9
> publisher offers no guarantees that all writes to a particular server
> will return before {{{_got_write_answer}}} is called to handle the
> results for a particular write. So a surprise share that is associated
> with a convergent and concurrent write might have either the checkstring
> of the current publish operation or the checkstring of the version
> associated with the existing share. The #546 check only accepts the share
> in the first case, which is probably why I added the exemption. It would
> be better to modify the #546 check to be specific about the second case
> instead of exempting all shares whose numbers match those we're writing.
> Alternatively, the #546 check could be retained as-is if we alter the
> publisher's control flow so that {{{_got_write_answer}}} is only executed
> for a response from a particular server after all writes to that server
> have completed. Since the publisher is designed to follow the existing
> share placement when placing a new version of a mutable file, it is
> likely that uncoordinated writers would try to place the same shares in
> the same places as one another. The exemption that is there now hurts the
> publisher's ability to detect this situation.
>
> The practical impact of the first regression is that SDMF publish
> operations are less able to figure out when they need to abort a publish
> and try again after another map update. The practical impact of the
> second regression is that the publisher might not detect uncoordinated
> writes that it would have been able to detect before 1.9, and that it
> might take longer to detect uncoordinated writes than before 1.9.

New description:

 [comment:4:ticket:546 Comment 4 in ticket 546] describes a fix made to the
 mutable publisher to allow it to tolerate unknown shares in certain
 circumstances without raising an !UncoordinatedWriteError. In the pre-1.9
 publisher, this is implemented by comparing the checkstring (seqnum,
 roothash, salt) of the unexpected share with the checkstring of the file
 being published. If they're the same, then the unexpected share is assumed
 to be either a share that the publish operation placed earlier or an
 uncoordinated convergent write, and tolerated without an uncoordinated
 write error. The 1.9 publisher changes this behavior in two ways.

 The first change is a bug. The checkstring that the check examines is set
 on lines 296, 496, and 828 of publish.py:

 {{{
         self._checkstring = self.writers.values()[0].get_checkstring()
 }}}

 {{{self.writes.values()[0]}}} can be an instance of either
 {{{MDMFSlotWriteProxy}}} or {{{SDMFSlotWriteProxy}}}.
 {{{MDMFSlotWriteProxy}}} returns a different checkstring than
 {{{SDMFSlotWriteProxy}}}; specifically, {{{MDMFSlotWriteProxy}}} returns
 the checkstring associated with the file version we're writing, while
 {{{SDMFSlotWriteProxy}}} returns the checkstring associated with the
 existing share (if any). Only {{{MDMFSlotWriteProxy}}} returns the
 checkstring associated with the current version of the mutable file, which
 is necessary in order for the #546 check to behave the same as in the
 pre-1.9 publisher. The fix for this issue is to change
 {{{SDMFSlotWriteProxy}}} to return the same checkstring as
 {{{MDMFSlotWriteProxy}}}.

 The second change is a design flaw. On line 987, I added the following:

 {{{
         # We need to remove from surprise_shares any shares that we are
         # knowingly also writing to that server from other writers.

         # TODO: Precompute this.
         known_shnums = [x.shnum for x in self.writers.values()
                         if x.server == server]
         surprise_shares -= set(known_shnums)
         self.log("found the following surprise shares: %s" %
                  str(surprise_shares))
 }}}

 which essentially exempts any surprise share that we know we're supposed
 to be writing during the publish operation from the #546 check.  The 1.9
 publisher offers no guarantees that all writes to a particular server will
 return before {{{_got_write_answer}}} is called to handle the results for
 a particular write. So a surprise share that is associated with a
 convergent and concurrent write might have either the checkstring of the
 current publish operation or the checkstring of the version associated
 with the existing share. The #546 check only accepts the share in the
 first case, which is probably why I added the exemption. It would be
 better to modify the #546 check to be specific about the second case
 instead of exempting all shares whose numbers match those we're writing.
 Alternatively, the #546 check could be retained as-is if we alter the
 publisher's control flow so that {{{_got_write_answer}}} is only executed
 for a response from a particular server after all writes to that server
 have completed. Since the publisher is designed to follow the existing
 share placement when placing a new version of a mutable file, it is likely
 that uncoordinated writers would try to place the same shares in the same
 places as one another. The exemption that is there now hurts the
 publisher's ability to detect this situation.

 The practical impact of the first regression is that SDMF publish
 operations are less able to figure out when they need to abort a publish
 and try again after another map update. The practical impact of the second
 regression is that the publisher might not detect uncoordinated writes
 that it would have been able to detect before 1.9, and that it might take
 longer to detect uncoordinated writes than before 1.9.

--

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1641#comment:2>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage