#590 closed enhancement (fixed)

add streaming manifest/deep-checker/repairer

Reported by: warner Owned by: warner
Priority: major Milestone: 1.4.1
Component: code-frontend-web Version: 1.2.0
Keywords: Cc:
Launchpad Bug:

Description

Our current deep-traversal webapi operations (manifest, deep-check, deep-repair, and to a lesser extent deep-stats) run asynchronously. The "start-operation" POST provides an unguessable "ophandle", the node builds up a big table of results, and every once in a while the client polls to see if the operation has completed (by doing a GET to /ophandle/$HANDLE), retrieving a representation of the big table when it's done.

We changed to this async mode from a synchronous model in which the client starts with a GET or a POST, the node builds up a big table of results, then the node returns the representation of that table to the initial GET/POST request. We made this change because a deep-check (in particular deep-verify or repair) can take a very long time, days or weeks for very large/deep directory structures, and HTTP requests felt fragile: browsers get interrupted, network addresses change, routes flap, etc. The idea of losing a lot of progress on a deep-traversal operation felt unacceptable to us.

However, the big table that is generated by any traversal operations (except for the trivial deep-stats) on a large directory structure is a problem all by itself. For several allmydata.com accounts, we're seeing upwards of 500k directories, and millions of files. This results in the tahoe node using something like 700MB of ram to hold this table, which (depending upon the host) results in swap thrash. In addition, the act of polling for the results causes large memory usage, both in the server (to translate the table into, say. JSON), and in the client (to translate it back).

We're currently trying to improve the efficiency and usability of deep-repair. We'd like to try a more streaming approach: one tool traverses the directory structure and builds up a list of filecaps/dircaps. A second tool then walks that list, performing a check on each file, writing the results to a second list. A third tool reads that check-results list, decides which files need repair, and performs a repair on those files. A dispatch tool could run these components on multiple allmydata.com accounts with whatever degree of parallelism seems appropriate.

All these tools but the first deep-traversal tool could be interrupted and pick up where they left off, if they kept suitable state about how far they'd gotten through their (linear) list of work to do. The deep-traversal tool would not have this luxury (since it's state is basically embedded in the python stack of the deep-traversal code inside a tahoe node), but hopefully the chances of interrupting a manifest operation are relatively low.

So, to accomplish this, and to reduce the memory usage of the node and CLI tool in question, we'd like to have streaming forms of 'build-manifest' and 'deep-check'. These forms will execute in a single HTTP operation (perhaps GET for manifest, since it has no side-effects, and POST for deep-check, since with the repair=true option is *does* have side-effects). Each form will emit incrementally-parseable units of status as part of their HTTP response body, one unit per file visited. When the traversal is complete, a second kind of unit may be emitted to report e.g. the deep-stats summary.

The CLI tools which use these interfaces will be able to read the response body incrementally, parsing each unit as they arrive, and then writing a summary to a file for further processing. At no point with either the Tahoe node or the CLI tool be holding information about more than a single file at a time. The Tahoe node's stack will, of course, hold state to manage the deep-traversal operation, but the size of this state is related to the depth of the directory tree (worst case is the size of all ancestors and uncle-nodes of the deepest child node), which is roughly log(N) instead of N.

The individual units can be JSON, as long as the stream has some easily parseable delimiters so the client can know where one unit ends and the next one begins. If we can guarantee that the JSON does not contain a newline, then we can use newlines as delimiters. If we can't enforce any such restrictions on the JSON, then we'd have to use netstrings for each unit, and parsing them is a bit harder.

If the individual units have a more custom format (like "\nSI=xxx;CAP=xxx;SIZE=xxx;HEALTHY=true\n"), then we can use tools like 'grep' on the output to e.g. filter out the files that need repair. It may be easier to use JSON and write our tools in Python.

Change History (3)

comment:1 Changed at 2009-02-17T05:42:41Z by warner

  • Milestone changed from undecided to 1.3.1
  • Owner set to warner
  • Status changed from new to assigned

476a5c8fac9909d5 adds the stream-deep-check webapi command, which includes repair=true. An earlier patch (which made it into 1.3.0, unlike 476a5c8fac9909d5) added stream-manifest.

The units are JSON, with no internal newlines, so there is one line of output per file/directory examined, plus one at the end with the aggregate stats. We can't use grep on the output, but the python tool that uses {{{[simplejson.loads(line) for line in out.splitlines()]))) is easy to write.

The only piece left is to modify the tahoe deep-check tool to use this streaming API instead of the old polling one. tahoe manifest has already been updated.

I think this is now complete.

comment:2 Changed at 2009-02-17T23:36:32Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed

fde2289e7b1fda8a updates the CLI "tahoe deep-check" command to use the streaming form.

comment:3 Changed at 2009-02-25T05:48:13Z by warner

And fd4ceb6a8762924c + a3c1fe35d9eda0df update the CLI commands to report errors properly.

Note: See TracTickets for help on using tickets.