[tahoe-lafs-trac-stream] [tahoe-lafs] #755: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information

Mon Jan 17 09:21:03 UTC 2011

#755: if there is an unrecoverable subdirectory, the deep-check report (both WUI
and CLI) loses other information
-------------------------------+--------------------------------------------
     Reporter:  zooko          |       Owner:  francois                                     
         Type:  defect         |      Status:  new                                          
     Priority:  critical       |   Milestone:  1.8.2                                        
    Component:  code-dirnodes  |     Version:  1.4.1                                        
   Resolution:                 |    Keywords:  usability error tahoe-check wui verify repair
Launchpad Bug:                 |  
-------------------------------+--------------------------------------------

Comment (by warner):

 Replying to [comment:24 francois]:
 > Thanks for the review! My comments are inline.
 >
 > Replying to [comment:22 warner]:
 >
 > > * we usually repair files that are unhealthy but recoverable. If
 > >   repair fails, the file should still be recoverable. The
 > >   post-repair-results are pessimistically being set to healthy=False
 > >   recoverable=False needs_rebalancing=False, when it's probably (and
 > >   sometimes certainly) more accurate to copy these values from the
 > >   pre-repair-results.
 >
 > I agree with what davidsarah said in [comment:23], it is difficult to
 > know the actual status when an exception was raised during the check
 > operation. However, it seems that simply removing the fields from the
 > results would necessitate other changes because I guess that many
 > parts of the code except them to be present.
 >
 > What would you think about setting healthy to its value before the
 > repair (most likely {{{False}}}) and other fields to {{{None}}}?
 > Something along those lines?
 >
 > {{{
 >   def _repair_error(f):
 >     prr = CheckResults(cr.uri, cr.storage_index)
 >     prr.data = copy.deepcopy(cr.data)
 >     prr.set_healthy(crr.pre_repair_results.is_healthy())
 >     prr.set_recoverable(None)
 >     prr.set_needs_rebalancing(None)
 >     crr.post_repair_results = prr
 >     crr.repair_successful = False
 >     crr.repair_failure = f
 >     return crr
 > }}}

 Ok, but {{{set_recoverable()}}} and {{{set_needs_rebalancing()}}} should
 be copied from the pre-repair values too. For immutable files it's
 certainly the case that repair cannot make things any worse, so if the
 file was recoverable before repair, it will be recoverable afterwards
 too. For mutable files, it's fuzzier, but once we get #1209 fixed, then
 repair that doesn't involve UCWE collisions or multiple versions should
 be strictly an improvement too. I think {{{set_needs_rebalancing()}}} is
 roughly the same.

 My big concern is doing a deep-repair while you're missing a few
 servers: all files are missing a few shares, so they aren't healthy and
 we try to repair them, but you're missing too many servers to
 successfully meet the servers-of-happiness threshold, so repair fails.
 On every single file. All the files are actually recoverable, but the
 post-repair results suggest that they are not. What I want to avoid is
 the deep-repair summary message telling users that 4000 out of 4000
 files are now unrecoverable and scaring the socks off them.

 > >  * it's probably worth checking the code coverage when we exercise
 > >    {{{test_mutable}}} and make sure the new code is getting run
 >
 > I don't remember how the code coverage infrastructure in the build
 > system actually works. It would be very kind of you if you tell me
 > which command I should run?

 I usually do '{{{make quicktest-coverage}}}', but I think "{{{python
 setup.py trial --coverage}}}" (or perhaps "{{{python setup.py trial
 --coverage --test-suite test_mutable}}}" to be a bit more selective)
 should do the
 same. That will create a .coverage file with the raw data. "{{{make
 coverage-output}}}", or following the commands listed in that section of
 the Makefile, will give you an HTML summary with color-coded source
 lines.

 > > * do we have any tests that confirm deep-repair on a tree with an
 > >   unrecoverable file (or directory) makes it through to the end
 > >   without an errback? We probably do but I'd like to be sure..
 > >   probably something in {{{test_deepcheck}}} exercises this.
 >
 > This is what I think calling {{{do_web_stream_check()}}} inside
 > {{{DeepCheckWebBad.test_bad()}}} should be doing, isn't it?

 I think that's mostly correct: it looks like {{{set_up_damaged_tree()}}}
 creates a root directory with 8 files (half mutable, half immutable),
 some of which are unrecoverable. But 1: {{{do_web_stream_check()}}}
 doesn't attempt repair, merely deep-check, and 2: there are no
 directories in that root, only files. Adding an unrecoverable directory
 is the important bit, since I think deep-repair and deep-check have
 enough common code paths that exercising deep-check is sufficient. (note
 that I think the 'broken' directory set up there is not used by
 {{{do_web_stream_check()}}}).

 > > * I see
 > >   {{{test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()}}}
 > >   asserts that an unrecoverable dirnode causes the traversal to
 > >   halt. Is this what we want? Is this ticket about making sure an
 > >   unrecoverable *file* doesn't halt a deep-repair, or about an
 > >   unrecoverable *dirnode*? (broken dirnodes are more significant
 > >   than files, because it means you've probably lost access to even
 > >   more data). We certainly want the deep-traversal to keep going and
 > >   repair more things, but we also need to make sure the user learns
 > >   about the dead dirnode.
 >
 > Yes, the traversal must continue in both cases. I was under the
 > impression that unrecoverable immutable files were already supported
 > and I understand this issue as being about unrecoverable direnodes.

 Yeah, {{{do_web_stream_check()}}} should cover the
 unrecoverable-immutable-file case (well, unless there's a difference in
 behavior between a web-based {{{t=stream-deep-check}}} and an internal
 dirnode-based {{{dirnode.start_deep_check()}}}, which is worth testing).
 So I agree, unrecoverable dirnodes is the important thing to check.

 So my hunch here is that we should add an unrecoverable directory to the
 'root' tree created in {{{set_up_damaged_tree()}}}, and adjust the
 counters to match, and then maybe we should get rid of the 'broken' tree
 and {{{do_deepcheck_broken()}}}.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/755#comment:25>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage