Opened at 2008-01-11T05:10:22Z
Last modified at 2008-05-14T20:21:51Z
#272 closed defect
mutable update can't recover from <k new shares — at Initial Version
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 1.1.0 |
Component: | code-mutable | Version: | 0.7.0 |
Keywords: | mutable recovery | Cc: | |
Launchpad Bug: |
Description
If a mutable slot has one or two shares at a newer version than the rest, our current SMDF update code is unable to recover: all attempts to write a newer version will fail with an UncoordinatedWriteError. The necessary fix is twofold: the first part is to implement mutable-slot recovery (to replace all shares with a version that's newer than the loner, by promoting the older-but-more-popular-and-also-recoverable version to a later seqnum), the second part is to jump directly to recovery when the post-query pre-write sharemap determines that someone has a newer version than the one we want to write (in Publish._got_all_query_results), then try the read-and-replace again or defer to the application.
Oh how I want the thorough unit tests envisioned in #270! They would have caught this earlier.
This problem happened to Peter's directory because of the accidental nodeid-change described in #269. But it could occur without that problem with the right sort of network partition at the right time.
Ok, on to the bug:
Suppose that share1 is placed on peer1, share2 on peer2, etc, and that we're doing 3-of-10 encoding. Now suppose that an update is interrupted in such a way that peer9 is updated but peers0 through 8 are not. Now most peers are at version 4, but that one loner peer is at version 5. (in the case of Peter's directory, most of the servers were upgraded incorrectly and changed their nodeids, thus changing the write_enablers, and the loner peer was the one who was either upgraded correctly or who wasn't upgraded at all).
If we read this slot, we read 'k' plus epsilon shares (four or five, I think). We see everyone is at version 4 (rather, everyone we see is at version 4), so we conclude that 4 is the most recent version. This is fine, because in the face of uncoordinated writes, we're allowed to return any version that was ever written.
Now, the next time we want to update this slot, mutable.Publish gets control. This does read-before-replace, and it is written to assume that we can use a seqnum one larger than the value that was last read. So we prepare to send out shares at version "5".
We start by querying N+epsilon peers, to build up a sharemap of the seqnum+roothash of every server we're thinking of sending a share to, so we can decide which shares to send to whom, and to build up the testv list. This sharemap is processed in Publish._got_all_query_results . It has a check that raises UncoordinatedWriteError if one of the queries reports the existence of a share that is newer than the seqnum that we already knew about. In this case, the response from peer9 shows seqnum=5, which is equal-or-greater than the "5" that we wanted to send out. This is evidence of an uncoordinated write, because our read pass managed to extract version 4 from the grid, but our query pass shows evidence of a version 5. We can tolerate lots of 5s and one or two 4s (because then the read pass would have been unable to reconstruct a version 4, and would have kept searching, and would have eventually reconstructed a version 5), but we can't tolerate lots of 4s and one or two 5s.
So this is the problem: we're spooked into an UncoordinatedWriteError by the loner new share, and since we don't yet have any recovery code, we can't fix the situation. If we tried to write out the new shares anyways, we could probably get a quorum of our new seqnum=5 shares, and if next update time we managed to reconstruct version 5, then we'd push out seqnum=6 shares to everybody, and then the problem would go away.
We need recovery to handle this. When the UncoordinatedWriteError is detected during the query phase, we should pass the sharemap to the recovery code, which should pick a version to reinforce (according to the rules we came up with in mutable.txt), and then send out shares as necessary to make that version the dominant one. (If I recall correctly, I think that means we would take the data from version 4, re-encode it into a version 6, then push 6 out to everyone).
Once recovery is done, the Publish attempt should still fail with an UncoordinatedWriteError, as a signal to the application that their write attempt might have failed. The application should perform a new read and possibly attempt to make its modification again.
The presence of the #269 bug would interact with recovery code in a bad way: it is likely that many of these shares are now immutable, and thus recovery would be unable to get a quorum of the new version. But since we could easily get into this state without #269 (by terminating a client just after it sends out the first share update, before it manages to get 'k' share-update messages out), this remains a serious problem.