[tahoe-lafs-trac-stream] [tahoe-lafs] #1795: Incomplete ServerMap triggers UncoordinatedWriteError upon mutable Publish

tahoe-lafs trac at tahoe-lafs.org
Wed Jul 25 03:04:46 UTC 2012


#1795: Incomplete ServerMap triggers UncoordinatedWriteError upon mutable Publish
--------------------------+---------------------------
 Reporter:  jean          |          Owner:
     Type:  defect        |         Status:  new
 Priority:  normal        |      Milestone:  undecided
Component:  code-mutable  |        Version:  1.9.2
 Keywords:                |  Launchpad Bug:
--------------------------+---------------------------
 This error has been seen in the wild while working on the Tamias system
 that uses tahoe-lafs as a storage layer. It seems to show up much more
 often in our testing environment where we do have a lot of clients
 connecting and leaving the network at high frequencies (client nodes, not
 storage nodes).

 Before overwriting a mutable file, the client builds a servermap using
 MODE_WRITE. This mode does not query all servers but stops querying when
 'epsilon' consecutive servers stated that they do not have a share. When
 this happens (hit boundary, in the log) the servermap is considered to be
 done if all servers on the left of the boundary have answered.

 In some corner cases, all those servers have answered but specific timing
 makes it so that the server is marked as having a share but the share
 information has not been processed yet. Because there are several
 concurrent calls to check_for_done, one of them might consider that the
 servermapupdate can stop running, actually preventing the processing of
 the last share.

 This results in a partial servermap. When the Publish operation starts, it
 might select the last server - the one missing from the servermap - as a
 candidate for the missing share. It will then issue a testv that checks
 for the absence of a share. This testv fails because there is a share, and
 a UCW is triggered.

 This can be seen in the attached log starting from event 6750, the
 boundary is found at 6898 and 6899 stops the servermap update. Event 6900
 has the partial servermap, and events 6903,6904 show the last share
 processing that is filtered because the servermap update has already been
 stopped. 6918 and 6920 show the servermap before (partial) and after
 (unforunately chosing the 'hidden' server whose answer was discarded).
 This leads to the eventual UCW at event 6955 triggered by the failed testv
 at 6953.

 In out testing environment, we use the attached workaround that moves the
 addition to the good_servers list at the very bottom of the deferedlist
 that is built per-server. This is expected to cause problems when servers
 have multiple shares, but it is just a temporary fix anyway.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1795>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage


More information about the tahoe-lafs-trac-stream mailing list