[tahoe-dev] [tahoe-lafs] #1212: Repairing fails if less than 7 servers available
tahoe-lafs
trac at tahoe-lafs.org
Thu Oct 14 04:40:38 UTC 2010
#1212: Repairing fails if less than 7 servers available
------------------------------+---------------------------------------------
Reporter: eurekafag | Owner:
Type: defect | Status: reopened
Priority: major | Milestone: 1.8.1
Component: code-network | Version: 1.8.0
Resolution: | Keywords: reviewed regression repair
Launchpad Bug: |
------------------------------+---------------------------------------------
Comment (by zooko):
I guess something that I haven't made up my mind about yet is how repair
jobs (either {{{tahoe repair}}} command on the cli or clicking on the
"check-and-repair" button on the wui) should handle the case that the
upload/repair fails, or partially fails on some of the files.
Should it proceed to completion, generate a report saying to what degree
each attempt to repair a file succeeded, and exit with a "success" code
(i.e. exit code 0 from {{{tahoe repair}}}), or should it abort the attempt
to repair this one file, and should it also abort any other file repair
attempts from the current deep-repair job?
For example, suppose you ask it to repair a single file with {{{K=3, H=7,
N=10}}}, and it finds out that there are only two storage servers
currently connected. One storage server has 3 shares and the other has 0.
Then should it abort the upload immediately? Or should it upload a few
shares (3?) to the second storage server which currently has none, and
then report to you that the file is still unhealthy?
Here is one set of principles to answer this question (not sure if this is
the best set):
1. ''Idempotence'' if you run an upload-or-repair job, and it does some
work (uploads some shares), and then you run it again when nothing has
changed among the servers (there are no servers that joined or left and
none of them acquired or lost shares), then the second run will not upload
any shares.
2. ''Forward progress'' if you run a repair job (not necessarily an upload
job!), and it is possible for it to make {{{|M|}}} greater than it was
before, then it will do so.
If we use these principles then we give up on an alternate principle:
3. ''Network efficiency'' if you run an upload or repair job, and it is
impossible for it to make {{{|M| >= H}}}, then it does not use any bulk
network bandwidth. (Also, if it looks like it is possible at first, but
after it has started uploading then one of the servers fails and it
becomes impossible, then it aborts right then and does not use any
''more'' of your network bandwidth.)
I think people (including me) intuitively wanted principle 3 for uploads,
but now that we are thinking about repairs instead of uploads we
intuitively want principle 2.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1212#comment:26>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-dev
mailing list