[tahoe-dev] Surprise shares

Tue Dec 9 11:06:15 PST 2008

On Tue, 09 Dec 2008 11:40:38 +0100
Francois Deppierraz <francois at ctrlaltdel.ch> wrote:

> While playing with a Tahoe storage grid running 1.2.0 storage nodes, a
> trunk client and the blackmatch fuse implementation, I possibly did
> not follow the "Don't Do That" coordination directive on mutable
> files.
> 
> The result is that now each time I try to create a new dirnode, even
> an unlinked one with "tahoe mkdir", it fails with an
> UncoordinatedWriteError and the following message gets logged.
> 
>   WEIRD they had shares [0] that we didn't know about

Oh, neat. Nope, it wasn't your fault, it's a bug in Tahoe that's triggered by
one of your storage servers running out of disk space. I've added a note to
ticket #546 about your experiences. #546 is a different expression of the
same problem, triggered by some other mutable-file bugs, that hit us last
week when we added a large number of new nodes to our production grid.

Basically, we pick a home for all 10 shares and send out "write" messages for
all of them. Then, one of your servers announced that the write failed
because they were out of space:

   * 11:04:54.839 [538]: UNUSUAL error while writing shares [6] to peerid 3dszcosf FAILURE:
   exceptions.OSError: [Errno 122] Disk quota exceeded: '/home/tahoe/.tahoe/storage/shares/dt/dt2yapm26hivzo5rnb2wdmetsy'

To handle that, your node decided to send share#6 to a different server, and
picked "stcz":

    * 11:04:54.841 [542]: current goal: after update: , sh0 to [stczs25o], sh1 to [vcqcuxod], sh2 to [om7xqw7k], sh3 to [j7bojlfb], sh4 to [fv4p67ck], sh5 to [qjmzhhda], sh6 to [stczs25o], sh7 to [7vlsqics], sh8 to [cfy2dvty], sh9 to [dlbsgdt4]

But we'd already sent share#0 to node stcz, so by the time the second "write"
message arrived there, they'd already accepted the first write. The response
to the second write tells the client that there's a share present which
wasn't there when the earlier servermap-update step ran, so the client
incorrectly concludes that someone else (an uncoordiated simultaneous writer)
must have put it there.

Basically the client code is forgetting that it already used that server, and
gets surprised by its own share.

I only discovered this problem a few days ago, and I'm still trying to think
of a good solution. One fix would probably be to have the publishing client
tolerate "surprise" shares that match the roothash that it was trying to
write anyways ("I don't know about this, but I like it"). A better one would
be to have the client remember which shares it sent to which servers, and
tolerate "surprise" shares which match that earlier write ("oh, right, I
remember doing this").

Incidentally, this is a specific issue of the fact that our mutable-file
publishing code still doesn't tolerate full-server errors very well. On our
allmydata.com production network, we've been carefully managing the "full"
servers to keep enough space free for (small) mutable files, to avoid
triggering bugs like this. On a 1TB drive, this means we mark the server node
as read-only when it still has 20GB of space left. (the problem is worse for
us because we have some very old clients connected to the grid, which are
even less tolerant of mutable-file write errors).

cheers,
 -Brian

#546: http://allmydata.org/trac/tahoe/ticket/546