id summary reporter owner description type status priority milestone component version resolution keywords cc launchpad_bug 893 UCWE when mapupdate gives up too early, then server errors require replacement servers warner "The Incidents that were reported in #877 (by Zooko, referring to mutable write errors experienced by nejucomo) indicate a thorny problem that is distinct from both #877 (caused by a reentrancy error) and #540 (caused by a logic bug that affects small grids where a publish wraps around the peer circle). Here's the setup: * mapupdate(MODE_WRITE) wants to find all shares, so they can all be updated. Ideally, the shares are concentrated at the beginning of the permuted peerlist, on the first N servers. * to avoid traversing the whole grid, mapupdate(MODE_WRITE) has a heuristic: if we've seen enough shares, and we've also seen a span of contiguous servers (in permuted order) that tell us they do not have a share, then we stop the search. The size of this span was intended to be N+epsilon (where 'epsilon' is a tradeoff between performance and safety, and is set to k). Unfortunately 1.5.0 had a bug, and the span size was set to k+epsilon instead. In the Incident: * mapupdate sent out 10 queries, but decided to finish before all the responses had returned, because it found a boundary early. I'm not entirely sure how the sharemap was shaped, but it looks like it stopped with a span of 3 servers (""found our boundary, 11000""), where k=3 and N=10, and returned a sharemap with 8 shares in it, some doubled up (I think there were 5 servers involved) * about 150ms after mapupdate finished, one more response came back (from {{{w6o6}}}, with two shares, but was ignored * Publish starts, and sends out updates to 7 servers. Unfortunately, two of these (both owned by secorp) experienced ""Permission Denied"" errors when attempting to write out the new shares, suggesting a configuration error (maybe the tahoe node process is owned by the wrong user) * so Publish falls over to new servers. Some of the new servers it picks suffer from the same error, so the fallover process repeats a few times. * finally, it falls over to the server {{{w6o6}}}, and sends it a new share, thinking that w6o6 has no shares (because the servermap was not updated to include w6o6's late response). * w6o6 then responds with a writev failure, because it contains shares that the test vector did not expect, causing a UCWE error * most of the shares were updated, so the write may have actually happened, even though UCWE was raised. The biggest problem with this failure is that it is persistent. We don't record any information that would tell a subsequent operation to look further for existing shares, so exactly the same thing will happen the next time we try to modify or repair the directory. If secorp's servers weren't throwing errors, then I think the condition would eventually fix itself: new shares would be placed on his servers, bridging the span of servers without shares, and then later mapupdate calls would keep going until they'd really seen all of the shares. Recently ([changeset:eb1868628465a243], 08-Dec-2009) I fixed mapupdate to use N+epsilon instead of k+epsilon. But the incident report suggests that it stopped with a span of only 3 no-share servers. Looking more closely at the code, I think it only waits for a span of epsilon (not k+epsilon or N+epsilon), and that [changeset:eb1868628465a243] changed something different. I don't know if the thing that was changed would have prevented this issue or not. It's possible that this is a manifestation of #547 (mapupdate triggers on a false boundary), or #549, or one of the other problems described in #546. In general, we need to query more servers. But even if we increase the span size or epsilon or whatever, there will always be a weird situtation that could be handled better if we queried more servers. We'd like to have something more adaptive: if the code hits UCWE because it didn't try hard enough, then it should try harder next time. How should we deal with this? We need something to persist from one operation on a given mutable filenode to the next, some sort of hint that says ""Hey, last time we were surprised, so next time you should look further"". Or something that tells us that we learned about shares on servers X+Y+Z, and so the next time we do a mapupdate, we shouldn't consider it complete until we've gotten responses from those servers (in addition to any others that we might decide to query). The most natural place to keep this state would be on the mutable filenode instance. This would help with UCWE that occurs inside a modify() call, because the same filenode is used for each retry, but in general filenodes are pretty short-lived. We don't want to keep the mutable filenode around in RAM forever. Maybe a LRU cache that keeps filenodes around for a few minutes, so that users who experience UCWE and retry the operation can benefit from recent history. A storage protocol that included ""where-are-the-other-shares"" hints (#599) would help: this would improve the reliability of mapupdate, since the persistent information would be kept on the storage servers, next to the shares. A publish process which rebalanced the shares (#232, or #661/#543/#864) might help, by filling in the gaps, except that here the gap was caused by a batch of servers all suffering from the same configuration problem. The right answer probably lies in having UCWE triggering an immediate repair, and having repair fill in the gaps. But it'd be nice if there were a way to stash some information on the shares before the gaps that let later operations know that they should look past the gap. " defect new critical soon code-mutable 1.5.0 availability preservation upload repair ucwe