#301 closed enhancement (fixed)

t=deep-check with JSON output, for automated checking

Reported by: zooko Owned by:
Priority: major Milestone: 1.3.0
Component: code-encoding Version: 0.7.0
Keywords: Cc:
Launchpad Bug:

Description

Run "check" on files and directories in an automated, regular way.

It's not clear how the checker process should get the verifier caps that it needs. See bigger, more general ticket #119 -- "lease expiration / deletion / filechecking / quotas".

Change History (7)

comment:1 Changed at 2008-02-07T19:32:00Z by zooko

I think it might be good to go ahead and implement the auto-checker without knowing how it is going to get its verifier caps. The interface to it can include that the client provides verifier caps.

comment:2 Changed at 2008-06-01T21:05:14Z by warner

  • Milestone changed from eventually to undecided

comment:3 Changed at 2008-09-03T01:35:36Z by warner

  • Milestone changed from undecided to 1.3.0

comment:4 Changed at 2008-09-04T18:58:06Z by warner

  • Summary changed from automate checking to t=deep-check with JSON output, for automated checking

The plan for this is:

  • provide a deep-check webapi with machine-readable (JSON) output
  • give responsibility for running deep-check to users or grid admins. They should periodically run deep-check (possibly with verify=true, probably with repair=true) on their root-caps.

comment:5 Changed at 2008-09-04T20:12:25Z by warner

Here are my docs/webapi.txt additions describing the JSON output for t=check and t=deep-check . I'm working on implementing this now.

POST $URL?t=check

  This triggers the FileChecker to determine the current "health" of the
  given file or directory, by counting how many shares are available. The
  page that is returned will display the results. This can be used as a "show
  me detailed information about this file" page.

  If a when_done=url argument is provided, the return value will be a redirect
  to that URL instead of the checker results.

  If a return_to=url argument is provided, the returned page will include a
  link to the given URL entitled "Return to the parent directory".

  If a verify=true argument is provided, the node will perform a more
  intensive check, downloading and verifying every single bit of every share.

  If an output=JSON argument is provided, the response will be
  machine-readable JSON instead of human-oriented HTML. The data is a
  dictionary with the following keys:

   storage-index: a base32-encoded string with the objects's storage index,
                  or an empty string for LIT files
   repair-attempted: (bool) True if repair was attempted
   repair-successful: (bool) True if repair was attempted and the file was
                      fully healthy afterwards.
   pre-repair-results: a dictionary that describes the state of the file
                       before any repair was performed. For LIT files, this
                       dictionary has only the 'healthy' key, which will
                       always be True. For distributed files, this dictionary
                       has the following keys:
     count-shares-good: the number of good shares that were found
     count-shares-needed: 'k', the number of shares required for recovery
     count-shares-expected: 'N', the number of total shares generated
     count-good-share-hosts: the number of distinct storage servers with
                             good shares. If this number is less than
                             count-shares-good, then some shares are doubled
                             up, increasing the correlation of failures. This
                             indicates that one or more shares should be
                             moved to an otherwise unused server, if one is
                             available.
     count-corrupt-shares: the number of shares with integrity failures
     list-corrupt-shares: a list of "share identifiers", one for each share
                          that was found to be corrupt. Each share identifier
                          is a list of (serverid, storage_index, sharenum).
     needs-rebalancing: (bool) True if there are multiple shares on a single
                        storage server, indicating a reduction in reliability
                        that could be resolved by moving shares to new
                        servers.
     servers-responding: list of base32-encoded storage server identifiers,
                         one for each server which responded to the share
                         query.
     healthy: (bool) True if the file is completely healthy, False otherwise.
              Healthy files have at least N good shares. Overlapping shares
              (indicated by count-good-share-hosts < count-shares-good) do not
              currently cause a file to be marked unhealthy. If there are at
              least N good shares, then corrupt shares do not cause the file to
              be marked unhealthy, although the corrupt shares will be listed
              in the results (list-corrupt-shares) and should be manually
              removed to wasting time in subsequent downloads (as the
              downloader rediscovers the corruption and uses alternate shares).
  post-repair-results: a dictionary (with the same keys as
                       pre-repair-results) that describes the state of the
                       file after any repair was performed. If no repair was
                       requested or required, 'pre-repair-results' and
                       'post-repair'results' will be identical. Note that
                       since immutable shares cannot be modified by clients,
                       any corrupt immutable shares in pre-repair-results
                       will remain in post-repair-results.

POST $URL?t=deep-check

  This triggers a recursive walk of all files and directories reachable from
  the target, performing a check on each one just like t=check. The result
  page will contain a summary of the results, including details on any
  file/directory that was not fully healthy.

  t=deep-check is most useful to invoke on a directory. If invoked on a file,
  it will just check that single object. The recursive walker will deal with
  loops safely.

  This accepts the same verify=, when_done=, and return_to= arguments as
  t=check.

  Be aware that this can take a long time: perhaps a second per object. No
  progress information is currently provided: the server will be silent until
  the full tree has been traversed, then will emit the complete response.

  If an output=JSON argument is provided, the response will be
  machine-readable JSON instead of human-oriented HTML. The data is a
  dictionary with the following keys:

   count-objects-checked: count of how many objects were checked
   count-objects-healthy: how many of those objects were completely healthy
   count-objects-unhealthy: how many were damaged in some way
   count-repairs-attempted: repairs were attempted on this many objects.
                            The count-repairs- keys will always be provided,
                            however unless repair=true is present, they will
                            all be zero.
   count-repairs-successful: how many repairs resulted in healthy objects
   count-repairs-unsuccessful: how many repairs resulted did not results in
                               completely healthy objects
   count-corrupt-shares: how many shares were found to have corruption,
                         summed over all objects examined
   list-corrupt-shares: a list of "share identifiers", one for each share
                        that was found to be corrupt. Each share identifier
                        is a list of (serverid, storage_index, sharenum).
   list-remaining-corrupt-shares: like list-corrupt-shares, but mutable shares
                                  that were successfully repaired are not
                                  included. These are shares that need
                                  manual processing. Since immutable shares
                                  cannot be modified by clients, all corruption
                                  in immutable shares will be listed here.
   list-unhealthy-files: a list of (pathname, check-results) tuples, for
                         each file that was not fully healthy. 'pathname' is
                         relative to the directory on which deep-check was
                         invoked. The 'check-results' field is the same as
                         that returned by t=check&output=JSON, described
                         above.

comment:6 Changed at 2008-09-06T05:44:40Z by warner

Ok, I just split check() and check_and_repair() into separate methods, because they return significantly different results. check() returns a single ICheckerResults instance, whereas the return value of check_and_repair() needs to have two such instances (pre-repair and post-repair), as well as indicating whether repair was attempted or not, and whether it was successful or not. This means there is also a deep_check() and deep_check_and_repair().

I left the webapi alone, but internally the POST t=check (and t=deep-check) implementation calls different methods depending upon the value of the repair= argument.

comment:7 Changed at 2008-09-18T05:20:36Z by warner

  • Resolution set to fixed
  • Status changed from new to closed

The webapi now does the right thing, and both mutable and immutable checkers provide the right sort of output. (zooko is still working on the immutable verifier).

Note: See TracTickets for help on using tickets.