[tahoe-lafs-trac-stream] [tahoe-lafs] #755: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information
tahoe-lafs
trac at tahoe-lafs.org
Mon Jan 17 09:21:03 UTC 2011
#755: if there is an unrecoverable subdirectory, the deep-check report (both WUI
and CLI) loses other information
-------------------------------+--------------------------------------------
Reporter: zooko | Owner: francois
Type: defect | Status: new
Priority: critical | Milestone: 1.8.2
Component: code-dirnodes | Version: 1.4.1
Resolution: | Keywords: usability error tahoe-check wui verify repair
Launchpad Bug: |
-------------------------------+--------------------------------------------
Comment (by warner):
Replying to [comment:24 francois]:
> Thanks for the review! My comments are inline.
>
> Replying to [comment:22 warner]:
>
> > * we usually repair files that are unhealthy but recoverable. If
> > repair fails, the file should still be recoverable. The
> > post-repair-results are pessimistically being set to healthy=False
> > recoverable=False needs_rebalancing=False, when it's probably (and
> > sometimes certainly) more accurate to copy these values from the
> > pre-repair-results.
>
> I agree with what davidsarah said in [comment:23], it is difficult to
> know the actual status when an exception was raised during the check
> operation. However, it seems that simply removing the fields from the
> results would necessitate other changes because I guess that many
> parts of the code except them to be present.
>
> What would you think about setting healthy to its value before the
> repair (most likely {{{False}}}) and other fields to {{{None}}}?
> Something along those lines?
>
> {{{
> def _repair_error(f):
> prr = CheckResults(cr.uri, cr.storage_index)
> prr.data = copy.deepcopy(cr.data)
> prr.set_healthy(crr.pre_repair_results.is_healthy())
> prr.set_recoverable(None)
> prr.set_needs_rebalancing(None)
> crr.post_repair_results = prr
> crr.repair_successful = False
> crr.repair_failure = f
> return crr
> }}}
Ok, but {{{set_recoverable()}}} and {{{set_needs_rebalancing()}}} should
be copied from the pre-repair values too. For immutable files it's
certainly the case that repair cannot make things any worse, so if the
file was recoverable before repair, it will be recoverable afterwards
too. For mutable files, it's fuzzier, but once we get #1209 fixed, then
repair that doesn't involve UCWE collisions or multiple versions should
be strictly an improvement too. I think {{{set_needs_rebalancing()}}} is
roughly the same.
My big concern is doing a deep-repair while you're missing a few
servers: all files are missing a few shares, so they aren't healthy and
we try to repair them, but you're missing too many servers to
successfully meet the servers-of-happiness threshold, so repair fails.
On every single file. All the files are actually recoverable, but the
post-repair results suggest that they are not. What I want to avoid is
the deep-repair summary message telling users that 4000 out of 4000
files are now unrecoverable and scaring the socks off them.
> > * it's probably worth checking the code coverage when we exercise
> > {{{test_mutable}}} and make sure the new code is getting run
>
> I don't remember how the code coverage infrastructure in the build
> system actually works. It would be very kind of you if you tell me
> which command I should run?
I usually do '{{{make quicktest-coverage}}}', but I think "{{{python
setup.py trial --coverage}}}" (or perhaps "{{{python setup.py trial
--coverage --test-suite test_mutable}}}" to be a bit more selective)
should do the
same. That will create a .coverage file with the raw data. "{{{make
coverage-output}}}", or following the commands listed in that section of
the Makefile, will give you an HTML summary with color-coded source
lines.
> > * do we have any tests that confirm deep-repair on a tree with an
> > unrecoverable file (or directory) makes it through to the end
> > without an errback? We probably do but I'd like to be sure..
> > probably something in {{{test_deepcheck}}} exercises this.
>
> This is what I think calling {{{do_web_stream_check()}}} inside
> {{{DeepCheckWebBad.test_bad()}}} should be doing, isn't it?
I think that's mostly correct: it looks like {{{set_up_damaged_tree()}}}
creates a root directory with 8 files (half mutable, half immutable),
some of which are unrecoverable. But 1: {{{do_web_stream_check()}}}
doesn't attempt repair, merely deep-check, and 2: there are no
directories in that root, only files. Adding an unrecoverable directory
is the important bit, since I think deep-repair and deep-check have
enough common code paths that exercising deep-check is sufficient. (note
that I think the 'broken' directory set up there is not used by
{{{do_web_stream_check()}}}).
> > * I see
> > {{{test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()}}}
> > asserts that an unrecoverable dirnode causes the traversal to
> > halt. Is this what we want? Is this ticket about making sure an
> > unrecoverable *file* doesn't halt a deep-repair, or about an
> > unrecoverable *dirnode*? (broken dirnodes are more significant
> > than files, because it means you've probably lost access to even
> > more data). We certainly want the deep-traversal to keep going and
> > repair more things, but we also need to make sure the user learns
> > about the dead dirnode.
>
> Yes, the traversal must continue in both cases. I was under the
> impression that unrecoverable immutable files were already supported
> and I understand this issue as being about unrecoverable direnodes.
Yeah, {{{do_web_stream_check()}}} should cover the
unrecoverable-immutable-file case (well, unless there's a difference in
behavior between a web-based {{{t=stream-deep-check}}} and an internal
dirnode-based {{{dirnode.start_deep_check()}}}, which is worth testing).
So I agree, unrecoverable dirnodes is the important thing to check.
So my hunch here is that we should add an unrecoverable directory to the
'root' tree created in {{{set_up_damaged_tree()}}}, and adjust the
counters to match, and then maybe we should get rid of the 'broken' tree
and {{{do_deepcheck_broken()}}}.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/755#comment:25>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list