id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc	launchpad_bug
893	UCWE when mapupdate gives up too early, then server errors require replacement servers	warner		"The Incidents that were reported in #877 (by Zooko, referring to
mutable write errors experienced by nejucomo) indicate a thorny
problem that is distinct from both #877 (caused by a reentrancy
error) and #540 (caused by a logic bug that affects small grids
where a publish wraps around the peer circle). Here's the setup:

 * mapupdate(MODE_WRITE) wants to find all shares, so they can all
   be updated. Ideally, the shares are concentrated at the beginning
   of the permuted peerlist, on the first N servers.
 * to avoid traversing the whole grid, mapupdate(MODE_WRITE) has a
   heuristic: if we've seen enough shares, and we've also seen a
   span of contiguous servers (in permuted order) that tell us they
   do not have a share, then we stop the search. The size of this
   span was intended to be N+epsilon (where 'epsilon' is a tradeoff
   between performance and safety, and is set to k). Unfortunately
   1.5.0 had a bug, and the span size was set to k+epsilon instead.

In the Incident:

 * mapupdate sent out 10 queries, but decided to finish before all
   the responses had returned, because it found a boundary early.
   I'm not entirely sure how the sharemap was shaped, but it looks
   like it stopped with a span of 3 servers (""found our boundary,
   11000""), where k=3 and N=10, and returned a sharemap with 8
   shares in it, some doubled up (I think there were 5 servers
   involved)
 * about 150ms after mapupdate finished, one more response came back
   (from {{{w6o6}}}, with two shares, but was ignored
 * Publish starts, and sends out updates to 7 servers.
   Unfortunately, two of these (both owned by secorp) experienced
   ""Permission Denied"" errors when attempting to write out the new
   shares, suggesting a configuration error (maybe the tahoe node
   process is owned by the wrong user)
 * so Publish falls over to new servers. Some of the new servers it
   picks suffer from the same error, so the fallover process repeats
   a few times.
 * finally, it falls over to the server {{{w6o6}}}, and sends it a
   new share, thinking that w6o6 has no shares (because the
   servermap was not updated to include w6o6's late response).
 * w6o6 then responds with a writev failure, because it contains
   shares that the test vector did not expect, causing a UCWE error
 * most of the shares were updated, so the write may have actually
   happened, even though UCWE was raised.


The biggest problem with this failure is that it is persistent. We
don't record any information that would tell a subsequent operation
to look further for existing shares, so exactly the same thing will
happen the next time we try to modify or repair the directory.

If secorp's servers weren't throwing errors, then I think the
condition would eventually fix itself: new shares would be placed on
his servers, bridging the span of servers without shares, and then
later mapupdate calls would keep going until they'd really seen all
of the shares.

Recently ([changeset:eb1868628465a243], 08-Dec-2009) I fixed mapupdate to use N+epsilon
instead of k+epsilon. But the incident report suggests that it
stopped with a span of only 3 no-share servers. Looking more closely
at the code, I think it only waits for a span of epsilon (not
k+epsilon or N+epsilon), and that [changeset:eb1868628465a243] changed something
different. I don't know if the thing that was changed would have
prevented this issue or not. It's possible that this is a
manifestation of #547 (mapupdate triggers on a false boundary), or
#549, or one of the other problems described in #546.

In general, we need to query more servers. But even if we increase
the span size or epsilon or whatever, there will always be a weird
situtation that could be handled better if we queried more servers.
We'd like to have something more adaptive: if the code hits UCWE
because it didn't try hard enough, then it should try harder next
time.

How should we deal with this? We need something to persist from one
operation on a given mutable filenode to the next, some sort of hint
that says ""Hey, last time we were surprised, so next time you should
look further"". Or something that tells us that we learned about
shares on servers X+Y+Z, and so the next time we do a mapupdate, we
shouldn't consider it complete until we've gotten responses from
those servers (in addition to any others that we might decide to
query).

The most natural place to keep this state would be on the mutable
filenode instance. This would help with UCWE that occurs inside a
modify() call, because the same filenode is used for each retry, but
in general filenodes are pretty short-lived. We don't want to keep
the mutable filenode around in RAM forever. Maybe a LRU cache that
keeps filenodes around for a few minutes, so that users who
experience UCWE and retry the operation can benefit from recent
history.

A storage protocol that included ""where-are-the-other-shares"" hints
(#599) would help: this would improve the reliability of mapupdate,
since the persistent information would be kept on the storage
servers, next to the shares.

A publish process which rebalanced the shares (#232, or
#661/#543/#864) might help, by filling in the gaps, except that here
the gap was caused by a batch of servers all suffering from the same
configuration problem.

The right answer probably lies in having UCWE triggering an
immediate repair, and having repair fill in the gaps. But it'd be
nice if there were a way to stash some information on the shares
before the gaps that let later operations know that they should look
past the gap.
"	defect	new	critical	soon	code-mutable	1.5.0		availability preservation upload repair ucwe