Opened at 2008-11-25T20:04:28Z
Last modified at 2012-11-13T23:27:23Z
#540 new defect
inappropriate "uncoordinated write error" after handling a server failure
Reported by: | warner | Owned by: | kevan |
---|---|---|---|
Priority: | normal | Milestone: | soon |
Component: | code-mutable | Version: | 1.2.0 |
Keywords: | availability upload ucwe test-needed | Cc: | |
Launchpad Bug: |
Description
I noticed the automated "speedtest" failing with an unexpected Uncoordinated Write Error for the past few days. There were several issues involved, but the one for this ticket is as follows:
- mutable publish assigns shares to servers, sends out requests. Let's say that share 1 goes to server A, and share 2 goes to server B.
- for whatever reason, server A returns an error
- the publish process must find a new server for share 1, say it picks B
- the publish process sends a readv-and-testv-and-writev for share 1 to server B
- but, it uses the same test vector that it used for the first request (the one that wrote share 2), which includes a clause that says "the server should not have any unknown shares". This probably only hits when we're first creating the mutable file.
- server B receives the request for share 2, and accepts it, and responds with success
- server B then receives the request for share 1, looks at the test vector, says "hey, but I already have a share (i.e. share 2)", so the test vector does not match, so the write is rejected
- the publish process sees the rejected write and concludes that someone else must have written a share at the same time, so it throws Uncoordinated Write Error
So really the sole publisher is colliding with themselves.
I think the fix would be to have the publisher keep track of which share requests it has sent, perhaps in the servermap (as "pending writes", or "proposed writes"). When the second writev request is generated, it should build a test vector based upon the pending write (so it includes share2).
Change History (14)
comment:1 Changed at 2009-12-29T18:50:27Z by warner
comment:2 Changed at 2010-01-13T19:38:15Z by davidsarah
- Keywords availability added
comment:3 Changed at 2010-01-13T19:38:56Z by davidsarah
- Keywords upload added
comment:4 Changed at 2010-01-14T00:02:48Z by zooko
This might be related to #899, newly reported by Kyle Markley and Andrej Falout.
comment:5 Changed at 2010-03-24T22:43:35Z by davidsarah
- Keywords ucwe added
- Priority changed from major to critical
comment:6 Changed at 2010-05-26T14:42:23Z by zooko
- Milestone changed from undecided to 1.8.0
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
comment:7 Changed at 2010-05-28T02:37:18Z by kevan
- Owner set to kevan
I'm almost certain that I'll end up squashing this with MDMF, so I'll assign it to myself.
comment:8 Changed at 2010-08-10T03:37:42Z by davidsarah
- Milestone changed from 1.8.0 to 1.9.0
comment:9 Changed at 2010-08-10T04:24:38Z by zooko
If you like this ticket, you might like #546 (mutable-file surprise shares raise inappropriate UCWE).
comment:10 Changed at 2010-08-10T04:28:08Z by zooko
If you like this ticket, you might like #547 (mapupdate(MODE_WRITE) triggers on a false boundary).
comment:11 Changed at 2011-07-16T20:34:38Z by davidsarah
Kevan will look at whether his MDMF patches squash this.
comment:12 Changed at 2011-07-16T20:36:58Z by davidsarah
- Keywords test-needed added
comment:13 Changed at 2011-07-16T20:44:36Z by davidsarah
- Milestone changed from 1.9.0 to soon
comment:14 Changed at 2012-11-13T23:27:23Z by zooko
- Priority changed from critical to normal
I think the publisher can also hit this for already-existing files too, where the first message says "I think you have sh1=ver1, here is sh1=ver2", and then (because of some other server having an error) it wants to add a second share to that same server, so it sends "I think you have sh1=ver1, here is sh2=ver2", and is surprised when the server says "actually I have sh1=ver2 you numbskull".
I think zooko's incident-2009-07-29-104230-vyc6byy.flog.bz2 in ticket #786 is related, but I haven't been able to figure it out exactly (it reports a surprise, but the log event says that their report matches our expectations, which makes me think that the code which logs the event is showing a different "expectation" than the one that was bundled in the testv portion of the share-write request.. it feels like two messages being sent at the same time to the same server).