[tahoe-lafs-trac-stream] [tahoe-lafs] #1280: if bucket_counter.state or lease_checker.state can't be written, stop the node with an error message (was: bucket_counter.state and lease_checker.state might get corrupted after hard system shutdown)
tahoe-lafs
trac at tahoe-lafs.org
Thu Aug 18 09:41:27 PDT 2011
#1280: if bucket_counter.state or lease_checker.state can't be written, stop the
node with an error message
--------------------------------+--------------------------------
Reporter: francois | Owner: zooko
Type: defect | Status: reopened
Priority: major | Milestone: 1.9.0
Component: code-nodeadmin | Version: 1.8.1
Resolution: | Keywords: pickle reliability
Launchpad Bug: |
--------------------------------+--------------------------------
Changes (by zooko):
* status: closed => reopened
* resolution: fixed =>
Comment:
Replying to [comment:1 warner]:
>
> We should probably also investigate using the "write-to-temp-
file/close/rename" dance.
It appears to me that Brian had already implemented the "write-to-temp-
file&&close&&rename" dance [20090219034633-4233b-
3abc69ace6ea2f453b1fdbefa754e7ecbc6b4516 about a year before he wrote the
comment above]. :-)
> One reason to not do this are performance (we should measure the diskio
hit for a big server). We should also think through the atomicity
expectations: if the rename fails, will the state of the pickle adequately
match the state of the servers (probably yes, but it's worth a few more
minutes of thought).
This is a crawler state file, so if the rename fails then the previous
version of the file is left there, right? So the crawler will re-do
whatever work it did since the last successful rename. If there is a
persistent problem which causes rename to fail (e.g. the permissions on
the old state file and its parent directory forbid you to overwrite it?),
then this would be a bad failure mode where the crawler always appears to
be doing work but never finishes. (The "Sisyphus" Failure Mode. :-))
Oh, except the failure to do the rename will cause the node to stop with
an error message, right? Re-opening this ticket and changing it to say "If
the write fails, stop the node with a clear error message.". Why stop the
node? Because there is no other reliable way to get the operator's
attention. Also because there is something screwed-up here, and stopping
is the safest course to prevent worse failures.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1280#comment:6>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list