#1795 assigned defect

Incomplete ServerMap triggers UncoordinatedWriteError upon mutable Publish

Reported by: jean Owned by: davidsarah
Priority: major Milestone: soon
Component: code-mutable Version: 1.9.2
Keywords: mutable ucwe servermap test-needed Cc:
Launchpad Bug:

Description

This error has been seen in the wild while working on the Tamias system that uses tahoe-lafs as a storage layer. It seems to show up much more often in our testing environment where we do have a lot of clients connecting and leaving the network at high frequencies (client nodes, not storage nodes).

Before overwriting a mutable file, the client builds a servermap using MODE_WRITE. This mode does not query all servers but stops querying when 'epsilon' consecutive servers stated that they do not have a share. When this happens (hit boundary, in the log) the servermap is considered to be done if all servers on the left of the boundary have answered.

In some corner cases, all those servers have answered but specific timing makes it so that the server is marked as having a share but the share information has not been processed yet. Because there are several concurrent calls to check_for_done, one of them might consider that the servermapupdate can stop running, actually preventing the processing of the last share.

This results in a partial servermap. When the Publish operation starts, it might select the last server - the one missing from the servermap - as a candidate for the missing share. It will then issue a testv that checks for the absence of a share. This testv fails because there is a share, and a UCW is triggered.

This can be seen in the attached log starting from event 6750, the boundary is found at 6898 and 6899 stops the servermap update. Event 6900 has the partial servermap, and events 6903,6904 show the last share processing that is filtered because the servermap update has already been stopped. 6918 and 6920 show the servermap before (partial) and after (unforunately chosing the 'hidden' server whose answer was discarded). This leads to the eventual UCW at event 6955 triggered by the failed testv at 6953.

In out testing environment, we use the attached workaround that moves the addition to the good_servers list at the very bottom of the deferedlist that is built per-server. This is expected to cause problems when servers have multiple shares, but it is just a temporary fix anyway.

Attachments (2)

changeset_r9db2f65ebb8eaa4f6094f2f99eff928ba285f5f5.diff (904 bytes) - added by jean at 2012-07-25T03:15:12Z.
workaround
ucw_text_transcript.log (14.5 KB) - added by jean at 2012-07-25T03:16:40Z.
Transcript of the incident report

Download all attachments as: .zip

Change History (4)

Changed at 2012-07-25T03:16:40Z by jean

Transcript of the incident report

comment:1 Changed at 2012-07-25T04:25:27Z by davidsarah

  • Keywords mutable ucwe servermap test-needed added
  • Milestone changed from undecided to 1.10.0
  • Owner set to davidsarah
  • Priority changed from normal to major
  • Status changed from new to assigned

comment:2 Changed at 2012-09-04T16:59:49Z by warner

  • Milestone changed from 1.10.0 to 1.11.0
Note: See TracTickets for help on using tickets.