[tahoe-lafs-trac-stream] [tahoe-lafs] #540: inappropriate "uncoordinated write error" after handling a server failure

tahoe-lafs trac at tahoe-lafs.org
Tue Nov 13 23:27:23 UTC 2012


#540: inappropriate "uncoordinated write error" after handling a server failure
-------------------------+-------------------------------------------------
     Reporter:  warner   |      Owner:  kevan
         Type:  defect   |     Status:  new
     Priority:  normal   |  Milestone:  soon
    Component:  code-    |    Version:  1.2.0
  mutable                |   Keywords:  availability upload ucwe test-
   Resolution:           |  needed
Launchpad Bug:           |
-------------------------+-------------------------------------------------
Changes (by zooko):

 * priority:  critical => normal


Old description:

> I noticed the automated "speedtest" failing with an unexpected
> Uncoordinated Write Error for the past few days. There were several
> issues involved, but the one for this ticket is as follows:
>
>  * mutable publish assigns shares to servers, sends out requests. Let's
> say that share 1 goes to server A, and share 2 goes to server B.
>  * for whatever reason, server A returns an error
>  * the publish process must find a new server for share 1, say it picks B
>  * the publish process sends a readv-and-testv-and-writev for share 1 to
> server B
>   * '''but''', it uses the same test vector that it used for the first
> request (the one that wrote share 2), which includes a clause that says
> "the server should not have any unknown shares". This probably only hits
> when we're first creating the mutable file.
>  * server B receives the request for share 2, and accepts it, and
> responds with success
>  * server B then receives the request for share 1, looks at the test
> vector, says "hey, but I already have a share (i.e. share 2)", so the
> test vector does not match, so the write is rejected
>  * the publish process sees the rejected write and concludes that someone
> else must have written a share at the same time, so it throws
> Uncoordinated Write Error
>
> So really the sole publisher is colliding with themselves.
>
> I think the fix would be to have the publisher keep track of which share
> requests it has sent, perhaps in the servermap (as "pending writes", or
> "proposed writes"). When the second writev request is generated, it
> should build a test vector based upon the pending write (so it includes
> share2).

New description:

 I noticed the automated "speedtest" failing with an unexpected
 Uncoordinated Write Error for the past few days. There were several issues
 involved, but the one for this ticket is as follows:

  * mutable publish assigns shares to servers, sends out requests. Let's
 say that share 1 goes to server A, and share 2 goes to server B.
  * for whatever reason, server A returns an error
  * the publish process must find a new server for share 1, say it picks B
  * the publish process sends a readv-and-testv-and-writev for share 1 to
 server B
   * '''but''', it uses the same test vector that it used for the first
 request (the one that wrote share 2), which includes a clause that says
 "the server should not have any unknown shares". This probably only hits
 when we're first creating the mutable file.
  * server B receives the request for share 2, and accepts it, and responds
 with success
  * server B then receives the request for share 1, looks at the test
 vector, says "hey, but I already have a share (i.e. share 2)", so the test
 vector does not match, so the write is rejected
  * the publish process sees the rejected write and concludes that someone
 else must have written a share at the same time, so it throws
 Uncoordinated Write Error

 So really the sole publisher is colliding with themselves.

 I think the fix would be to have the publisher keep track of which share
 requests it has sent, perhaps in the servermap (as "pending writes", or
 "proposed writes"). When the second writev request is generated, it should
 build a test vector based upon the pending write (so it includes share2).

--

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/540#comment:14>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage


More information about the tahoe-lafs-trac-stream mailing list