[tahoe-lafs-trac-stream] [tahoe-lafs] #1280: if bucket_counter.state or lease_checker.state can't be written, stop the node with an error message (was: bucket_counter.state and lease_checker.state might get corrupted after hard system shutdown)

Thu Aug 18 09:41:27 PDT 2011

#1280: if bucket_counter.state or lease_checker.state can't be written, stop the
node with an error message
--------------------------------+--------------------------------
     Reporter:  francois        |      Owner:  zooko
         Type:  defect          |     Status:  reopened
     Priority:  major           |  Milestone:  1.9.0
    Component:  code-nodeadmin  |    Version:  1.8.1
   Resolution:                  |   Keywords:  pickle reliability
Launchpad Bug:                  |
--------------------------------+--------------------------------
Changes (by zooko):

 * status:  closed => reopened
 * resolution:  fixed =>

Comment:

 Replying to [comment:1 warner]:
 >
 > We should probably also investigate using the "write-to-temp-
 file/close/rename" dance.

 It appears to me that Brian had already implemented the "write-to-temp-
 file&&close&&rename" dance [20090219034633-4233b-
 3abc69ace6ea2f453b1fdbefa754e7ecbc6b4516 about a year before he wrote the
 comment above]. :-)

 > One reason to not do this are performance (we should measure the diskio
 hit for a big server). We should also think through the atomicity
 expectations: if the rename fails, will the state of the pickle adequately
 match the state of the servers (probably yes, but it's worth a few more
 minutes of thought).

 This is a crawler state file, so if the rename fails then the previous
 version of the file is left there, right? So the crawler will re-do
 whatever work it did since the last successful rename. If there is a
 persistent problem which causes rename to fail (e.g. the permissions on
 the old state file and its parent directory forbid you to overwrite it?),
 then this would be a bad failure mode where the crawler always appears to
 be doing work but never finishes. (The "Sisyphus" Failure Mode. :-))

 Oh, except the failure to do the rename will cause the node to stop with
 an error message, right? Re-opening this ticket and changing it to say "If
 the write fails, stop the node with a clear error message.". Why stop the
 node? Because there is no other reliable way to get the operator's
 attention. Also because there is something screwed-up here, and stopping
 is the safest course to prevent worse failures.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1280#comment:6>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage