[tahoe-dev] mutable file and directory safety in allmydata.org "tahoe" 0.9

Wed Mar 12 12:30:34 PDT 2008

> So, problem 1 with the "update" method that you described (in the  
> letter to which this is a reply) is that, after you read directory  
> version N, and then write back directory version N+1, and someone  
> else has also (previously -- not even at the same time as you!)  
> written up a directory version N+1, and if the "root hash" of your N 
> +1 happens to sort higher than the root hash of their N+1  
> lexicographically, then you will not get any indication that there  
> ever was their version N+1 -- instead your version N+1 will silently  
> overwrite theirs.

Zooko and I looked at the code, and I concluded that this isn't a problem. At
http://allmydata.org/trac/tahoe/browser/src/allmydata/mutable.py#L1559 (in
Publish._got_write_answer), the publish operation compares the retrieved
"checkstring" (which includes the seqnum and root hash that was just
replaced) against the version that we last saw on that server. The publish
operations is "surprised" (i.e. will return UncoordinatedWriteError) if the
checkstring differs. The write vector is applied if the testv conditions are
met: if the seqnum is higher, or if the seqnum is equal and the roothash is
higher (the condition zooko identified above). However the publisher can be
surprised even if the write vector was applied: that's why the remote
operation returns the values examined by the test vectors in addition to the
write-applied/not-applied decision. The method is called
testv_and_readv_and_writev instead of just testv_and_writev.

> Problem 2 is that when you do the read-back after detecting a UCW, if  
> the first 3 servers that you talk to all have your version N+1, then  
> you will treat that as the "current" version N+1, generate a version N 
> +2, and then upload you N+2, which was made without knowledge of the  
> other person's N+1.  Problem 2 is the one that I think would take a  
> few weeks to do right.  Ideally, your version N+2 should probably  
> come with a set of root hashes of predecessors, and the test-and-set  
> should say "This is the set of versions that I have already seen and  
> am intending to supercede -- if your current version is one of them  
> then please overwrite it with my new version.  If your current  
> version is not one of them then please do not overwrite it, and  
> return an error.".
> 
> You and I tried to solve this one before and did not yet come up with  
> a satisfactory solution.

Yeah, that one is trickier. LMDF (with a complete version tree, rather than a
somewhat linear version history) is probably the solution.

Restricting ourselves to SMDF, I think this is actually not that bad. First
off, #272 recovery should make all writers start by reinforcing a single
version, and the randomized backoff that we recommend in response to a UCW
should reduce the chance of a second collision happening. That should reduce
the chance that the shares are in an inconsistent state when you perform the
next read, and effectively give everybody a chance to apply their deltas one
at a time.

Secondly, publish need to replace all existing shares, so we have to talk to
at least N servers (not just k), and we must first read from each of those
servers to determine what their current seqnum+R is. So even if the shares
are in an inconsistent state, it would be hard to fail to see the N+1(b)
share. Of course, we need to give the next layer up more information to deal
with this properly: as long as we only return a single version (either N+1(a)
or N+1(b)), they don't have a chance to do any sort of merging, and one of
the variants will get lost.

cheers,
 -Brian