Opened at 2008-01-11T05:10:22Z
Closed at 2008-05-14T20:21:51Z
#272 closed defect (fixed)
implement mutable-file recovery: update can't recover from <k new shares
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | critical | Milestone: | 1.1.0 |
Component: | code-mutable | Version: | 0.7.0 |
Keywords: | mutable recovery | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
If a mutable slot has one or two shares at a newer version than the rest, our current SMDF update code is unable to recover: all attempts to write a newer version will fail with an UncoordinatedWriteError. The necessary fix is twofold: the first part is to implement mutable-slot recovery (to replace all shares with a version that's newer than the loner, by promoting the older-but-more-popular-and-also-recoverable version to a later seqnum), the second part is to jump directly to recovery when the post-query pre-write sharemap determines that someone has a newer version than the one we want to write (in Publish._got_all_query_results), then try the read-and-replace again or defer to the application.
Oh how I want the thorough unit tests envisioned in #270! They would have caught this earlier.
This problem happened to Peter's directory because of the accidental nodeid-change described in #269. But it could occur without that problem with the right sort of network partition at the right time.
Ok, on to the bug:
Suppose that share1 is placed on peer1, share2 on peer2, etc, and that we're doing 3-of-10 encoding. Now suppose that an update is interrupted in such a way that peer9 is updated but peers0 through 8 are not. Now most peers are at version 4, but that one loner peer is at version 5. (in the case of Peter's directory, most of the servers were upgraded incorrectly and changed their nodeids, thus changing the write_enablers, and the loner peer was the one who was either upgraded correctly or who wasn't upgraded at all).
If we read this slot, we read 'k' plus epsilon shares (four or five, I think). We see everyone is at version 4 (rather, everyone we see is at version 4), so we conclude that 4 is the most recent version. This is fine, because in the face of uncoordinated writes, we're allowed to return any version that was ever written.
Now, the next time we want to update this slot, mutable.Publish gets control. This does read-before-replace, and it is written to assume that we can use a seqnum one larger than the value that was last read. So we prepare to send out shares at version "5".
We start by querying N+epsilon peers, to build up a sharemap of the seqnum+roothash of every server we're thinking of sending a share to, so we can decide which shares to send to whom, and to build up the testv list. This sharemap is processed in Publish._got_all_query_results . It has a check that raises UncoordinatedWriteError if one of the queries reports the existence of a share that is newer than the seqnum that we already knew about. In this case, the response from peer9 shows seqnum=5, which is equal-or-greater than the "5" that we wanted to send out. This is evidence of an uncoordinated write, because our read pass managed to extract version 4 from the grid, but our query pass shows evidence of a version 5. We can tolerate lots of 5s and one or two 4s (because then the read pass would have been unable to reconstruct a version 4, and would have kept searching, and would have eventually reconstructed a version 5), but we can't tolerate lots of 4s and one or two 5s.
So this is the problem: we're spooked into an UncoordinatedWriteError by the loner new share, and since we don't yet have any recovery code, we can't fix the situation. If we tried to write out the new shares anyways, we could probably get a quorum of our new seqnum=5 shares, and if next update time we managed to reconstruct version 5, then we'd push out seqnum=6 shares to everybody, and then the problem would go away.
We need recovery to handle this. When the UncoordinatedWriteError is detected during the query phase, we should pass the sharemap to the recovery code, which should pick a version to reinforce (according to the rules we came up with in mutable.txt), and then send out shares as necessary to make that version the dominant one. (If I recall correctly, I think that means we would take the data from version 4, re-encode it into a version 6, then push 6 out to everyone).
Once recovery is done, the Publish attempt should still fail with an UncoordinatedWriteError, as a signal to the application that their write attempt might have failed. The application should perform a new read and possibly attempt to make its modification again.
The presence of the #269 bug would interact with recovery code in a bad way: it is likely that many of these shares are now immutable, and thus recovery would be unable to get a quorum of the new version. But since we could easily get into this state without #269 (by terminating a client just after it sends out the first share update, before it manages to get 'k' share-update messages out), this remains a serious problem.
Change History (18)
comment:1 Changed at 2008-02-08T02:52:40Z by warner
- Summary changed from mutable update can't recover from <k new shares to implement mutable-file recovery: update can't recover from <k new shares
comment:2 Changed at 2008-03-04T19:45:46Z by zooko
- Description modified (diff)
- Milestone changed from undecided to 0.9.0 (Allmydata 3.0 final)
comment:3 Changed at 2008-03-04T20:15:26Z by zooko
comment:4 Changed at 2008-03-04T20:16:19Z by zooko
I mean #232 -- "Peer selection doesn't rebalance shares on overwrite of mutable file."
comment:5 Changed at 2008-03-10T19:39:53Z by zooko
- Owner set to zooko
- Status changed from new to assigned
comment:6 Changed at 2008-03-12T18:54:21Z by zooko
- Milestone changed from 0.9.0 (Allmydata 3.0 final) to 0.10.0
comment:7 Changed at 2008-03-20T01:43:25Z by zooko
Rob hit a bad case today which might have been caused by this issue. He was doing parallel, uncoordinated writes to directories (by accident), and he reported:
<robk-allmydata> allmydata.encode.NotEnoughPeersError: Unable to find a copy of the privkey [13:29] <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: I was surprised! <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: somebody has a newer sequence number than us <robk-allmydata> allmydata.encode.NotEnoughPeersError: Unable to find a copy of the privkey <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: somebody has a newer sequence number than us <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: I was surprised! <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: somebody has a newer sequence number than us
comment:8 Changed at 2008-03-21T22:30:37Z by zooko
- Milestone changed from 0.10.0 to undecided
comment:9 Changed at 2008-03-21T22:30:55Z by zooko
- Milestone changed from undecided to 1.0
comment:10 Changed at 2008-03-21T23:27:23Z by zooko
For Milestone 1.0.0, let's see if we can figure out what happened with Rob's client there, and see if it is likely to cause any problems for clients in Tahoe v1.0.0, and if not then bump this ticket.
comment:11 Changed at 2008-03-22T02:06:35Z by warner
that "I was surprised" is certainly an indication of detected collision. Not being able to find the encrypted privkey.. oh, I think I understand that one. The encrypted private key is stored at the end of the share, since it's kind of large, and the dominant use case (reading) doesn't need to retrieve it. On the first pass (the "reconnaissance phase"), we find out what the data size is, and therefore which offset the encprivkey is at. On the second pass, we use that offset to retrieve the encprivkey. But if someone sneaks in between our first and second passes (and writes new data, of a different size), then the key will have moved, and our attempt to read the encprivkey will grab the wrong data. I'm not sure what exactly will happen (i.e. how we validate the encprivkey), but somewhere in there it ought to throw a hash check exception and drop that alleged copy of the privkey, causing the code to try and find a better one. If this process discards all potential copies, the upload will fail for lack of a privkey.
So, if Rob could identify the exact time of the collision, we could probably look at the server logs and confirm the server-side collision sequence.
The rule remains: Don't Do That. :)
comment:12 Changed at 2008-03-22T22:56:24Z by zooko
Brian:
Thanks for the analysis. The collision sequence that you describe should result in this error message once, but then subsequent attempts to overwrite should succeed. But I thought Rob described a problem where the directory could not be written again. Perhaps I imagined that last part.
If it is a transient problem then I am satisfied with this for allmydata.org "Tahoe" 1.0.
The rule remains: Don't Do That. :)
What? No it doesn't! We've decided that it is okay for users of Tahoe to do this, as long as they don't do it with too many simultaneous writers at too fast a rate, and they don't mind all but one of their colliding writes disappearing with a trace.
Right?
comment:13 Changed at 2008-03-24T18:33:09Z by warner
You're right. The rule is: Don't Do That Very Much. :)
comment:14 Changed at 2008-03-24T22:25:47Z by zooko
- Milestone changed from 1.0.0 to 1.1.0
Okay, we talked about this and concluded that what Rob saw could be explained by the uncoordinated writes that his script accidentally generated. We updated the error messages to be clearer about what was surprising about the sequence numbers. Bumping this ticket out of Milestone 1.0.0.
comment:15 Changed at 2008-04-24T23:28:12Z by warner
- Component changed from code-encoding to code-mutable
comment:16 Changed at 2008-05-12T19:52:15Z by zooko
- Owner changed from zooko to warner
- Status changed from assigned to new
Brian: should this ticket be updated to include anything from your recent refactoring of the mutable file upload/download code?
comment:17 Changed at 2008-05-14T19:59:35Z by warner
It's almost closed. I believe the new servermap scheme should fix this case. I'm working on a test case to specifically exercise it now.
comment:18 Changed at 2008-05-14T20:21:51Z by warner
- Resolution set to fixed
- Status changed from new to closed
Closed, by the test added in ff0b9e25499c7e5f.
See also related issue #232 -- "http://allmydata.org/trac/tahoe/ticket/232".