#347 closed defect (fixed)

reduce write-collision clobbering: use 'eq' instead of 'le' in test+set

Reported by: warner Owned by:
Priority: major Milestone: 1.1.0
Component: code-mutable Version: 0.8.0
Keywords: coordination Cc:
Launchpad Bug:

Description

Over lunch today, as we discussed uncoordinated writes in mutable files, Zooko astutely identified a significant problem: two clients who write to a mutable file at the same time have a high probability of losing one writer's changes. Specifically, writers Alice and Bob write at the same time, and much of the time Alice will believe that her changes have been successfully placed, Bob will see an UncoordinatedWriteError, and the file will wind up with Bob's changes but not Alice's. We think there is an equal probability that Bob will see the UCW but the file will have Alice's changes (but not Bob's), in which case the usual retry response to a UCW will result in a file with both changes.

The general problem is that our design goals have been shifting around. When we first designed the SMDF update algorithm, we assumed that the Prime Coordination Directive (i.e. "don't do that") was enough, and that bad behavior in the face of uncoordinated writes was excusable. Therefore our goal was to maximize the health of *some* version of the file (not necessarily the most recent one) when a UCW occurred. However, the past few months have shown that coordinating writes is a hassle, and developers of the next few layers above mutable files would be much happier if tahoe were to handle this requirement for then. In addition, we defined a number of fallback strategies, all of which serve to improve the behavior when UCW occurs.

At some point, we started thinking that these fallbacks were a suitable replacement for write coordination. This is wrong.

We designed the fallbacks to improve file health, not to obtain the consistency/atomic-update characteristics that developers would like to see in mutable files. Specifically it would be nice if there were a reliable test-and-set operation for mutable files, such that you could be sure that either 1) your version wins, or 2) your version lost. But our fallbacks were not designed to do this.

The specific problem is that our testv-and-reav-and-writev calls are being made with a test vector comparison operator that instructs the storage server to apply the write if the old checkstring is "le" (less than or equal to) the new one. The sequence number is at the MSB end of the checkstring, followed by the root hash. The most common case is for the new checkstring to have a seqnum that is one higher than the old one, and for the old roothash to be the same as the one that we saw on that server during our survey pass. This case is accepted by the test vector, so the write goes through. (note that we compare the old roothash after the testv+writev returns, rather than putting it into the test vector).

However, if some other writer Bob has raced ahead of us/Alice, they will have written a new version of the share into place: say they replaced the previous v0 with v1b, and now we're trying to write our v1a into the same place. The roothash that we retrieve will be different than what we surveyed (since it is v1b instead of v0), so we'll signal UCW when we're done. But 50% of the time, our v1a roothash will be higher than Bob's v1b roothash, and our write vectors will be applied despite the UCW. In this case, we will overwrite Bob's data, and Bob (who saw no errors and has retired the write) will have gone away, resulting in the loss of Bob's changes.

In the other 50% of the time, Bob's v1b roothash will be higher than our v1a roothash, which means that our write vectors will be rejected, preserving Bob's changes. Alice will see the UCW and retry, merging her changes with Bob's, and everybody will be happy.

So the net result is that overlapping writes have a 50% chance of losing one set of changes.

We're guessing that we can fix this by changing the test vector to be (eq oldseqnum oldroothash) instead of (le newseqnum newroothash). This would give the same UCW function (i.e. any combinations that would have resulted in a UCW in the 'le' case will also result in a UCW in the 'eq' case), but should refrain from doing the share writes in all of the UCW cases instead of only a subset. This would result in Alice's writes leaving Bob's alone. In the 50% change-losing case, Alice would see the UCW as before (and Bob would not), but v1b would still be in place, allowing Alice to retry and merge without losing the data.

The reason we didn't make this choice earlier is that, without recovery, *not* overwriting the shares will leave them in a less healthy state than overwriting them with a convergent version. Basically it is easier for everybody to agree on which version should be reinforced by having a strong total ordering on those versions, which the 'le' case provides. Without this, different clients will be requesting different things, and there will be a higher chance of the update phase finishing with a variety of shares (and in this case, more variety means less robust).

If we assume that recovery will be performed, then this loss of robustness isn't too serious, and switching to the 'eq' form seems like a better idea.

We need to analyze this carefully first. We spent several weeks going back and forth on this design when we first made it, so a change like this is deserving of more discussion. Also, we need to reevaluate the algorithm in light of our shifting design goals: if we want mutable files to give good update properties *without* coordination, then we must keep that in mind when we review the design.

Change History (2)

comment:1 Changed at 2008-03-25T22:50:33Z by warner

  • Milestone changed from undecided to 1.1.0

This issue is making us nervous, so we'd like to see it be a priority for a soon-after-1.0 release.

comment:2 Changed at 2008-04-24T23:30:16Z by warner

  • Component changed from code-encoding to code-mutable
  • Resolution set to fixed
  • Status changed from new to closed

The new mutable-file refactoring (moving to 'servermaps') now does eq instead of le, so this issue is resolved. The recovery code tracked in #272 may wish to have 'le' instead, but it seems better to avoid the confusing write in the first place and let recovery deal with the slightly reduced file health instead.

Note: See TracTickets for help on using tickets.