#514 closed enhancement (fixed)

deep-check: progress and cancel

Reported by: warner Owned by:
Priority: major Milestone: 1.3.0
Component: code-frontend-web Version: 1.2.0
Keywords: Cc:
Launchpad Bug:

Description

Peter pointed out a usability problem with deep-check over lunch today: it takes a long time, and you don't get any feedback as it's running. Also, there's no way to cancel it when you get tired of waiting.

The cleanest solution I could think of would be to add an "operation-handle": the t=deep-check webapi request could include a "handle=XYZ" argument. If so, then instead of waiting until the operation was complete and then showing the results page, it would redirect to an "operation progress" page, keyed by the handle. Each time this page is loaded, it should show a count of how many files/directories have been visited. When the deep-check operation is complete, that page should switch to showing the regular results (the same page that would appear without the handle= argument).

The idea would be that clients would pick a random nonce for the handle= argument, and then keep polling the progress page until it indicates that the operation is complete. The wui could make this easier by putting a random handle= argument in the deep-check form that it provides for each directory.

The progress page should also have a cancel button. This implies changes to the implementation of dirnode.deep_check to allow the traversal to be cancelled. It isn't completely necessary to terminate checks that are already in process, but instead just prevent any new ones from being started. This suggests a generic progress/cancel object, to be passed into long-running operations, which they can poll occasionally to see if they should keep running.

Change History (8)

comment:1 Changed at 2008-09-14T15:59:27Z by zooko

The handle=XYZ sounds like it would work, but what about the alternative of having progress or early results sent back in the HTTP response immediately, and the user can cancel the operation by aborting the HTTP connection?

This would be equivalent to the Unix style, where the tool prints out results as it goes, and you hit control-C to stop it. It also seems "RESTful".

comment:2 Changed at 2008-09-17T06:13:47Z by warner

Do you have any ideas how the "early results" could be expressed in human-readable HTML? The only thing I can think of is to emit something like "1%..\n2%..\n3%..\n", and ask the user to scroll down a whole lot to see the full results at the bottom of all the progress messages.

This would require bypassing the nevow XHTML rendering stuff for that page, but I'm ok with that. (I kind of plan to rip out nevow anyway, since it doesn't really add enough functionality to justify the extra build-time dependency).

I can't think of any way to make this work for machine-readable JSON.

comment:3 Changed at 2008-10-07T23:08:53Z by warner

Ok, we've basically agreed on having two ways to drive long-running operations (including deep-size, deep-stats, deep-check, and deep-repair):

  • start, poll(output=HTML)
  • start, poll(output=JSON)

The 'start' operation would look like POST /uri/$URI/?t=deep-check&pollhandle=XYZ, where XYZ is an arbitrary string: the client is obligated to randomly-generate a unique one.

The 'poll' operation would look like GET /pollresults/XYZ?t=HTML or GET /pollresults/XYZ?t=JSON (where HTML would probably be the default). The HTML pollresults would include a meta-refresh tag (for say 30 seconds) until the operation was complete.

One outstanding design question: should the pollresults contain a link back to the original resource? Probably not: to do so would add the requirement that 'pollhandle' be unguessable as well as unique, which would be a lurking security problem.

A more important question: how long should pollhandles remain valid? Clearly the pollhandle should be kept alive by the operation itself, and they need to stay alive for at least a while after it completes (until the client next polls for results). Should they stay alive after that? What if our policy is:

  • if the operation is still running, the handle remains alive
  • if the operation has completed, but nobody has polled the results since completion, the handle remains alive for one hour or for however long the operation took to complete, whichever is higher
  • if the op is complete and the results have been polled since completion, the handle remains alive for two minutes since the last poll

The choice is a tradeoff between memory consumption (to retain handles that nobody cares about) and making sure that someone sees the results that they asked for.

Another question is the cancel behavior. Perhaps POST /pollresults/XYZ?t=cancel.

comment:4 Changed at 2008-10-08T00:53:21Z by warner

Here's the proposed documentation update:

Slow Operations, Progress, and Cancelling

Certain operations can be expected to take a long time. The "t=deep-check", described below, will recursively visit every file and directory reachable from a given starting point, which can take minutes or even hours for extremely large directory structures. A single long-running HTTP request is a fragile thing: proxies, NAT boxes, browsers, and users may all grow impatient with waiting and give up on the connection.

For this reason, long-running operations have an "operation handle", which can be used to poll for status/progress messages while the operation proceeds. This handle can also be used to cancel the operation. These handles are created by the client, and passed in as a an "ophandle=" query argument to the POST or PUT request which starts the operation. The following operations can then be used to retrieve status:

GET /operations/$HANDLE?t=status&output=HTML GET /operations/$HANDLE?t=status&output=JSON

These two retrieve the current status of the given operation. Each operation presents a different sort of information, but in general the page retrieved will indicate:

  • whether the operation is complete, or if it is still running
  • how much of the operation is complete, and how much is left, if possible

The HTML form will include a meta-refresh tag, which will cause a regular web browser to reload the status page about 30 seconds later. This tag will be removed once the operation has completed.

POST /operations/$HANDLE?t=cancel

This terminates the operation, and returns an HTML page explaining what was cancelled. If the operation handle has already expired (see below), this POST will return a 404, which indicates that the operation is no longer running (either it was completed or terminated).

The operation handle will eventually expire, to avoid consuming an unbounded amount of memory. The rules for handle lifetime are:

  • handles will remain valid at least until their operation finishes
  • uncollected handles for finished operations (i.e. handles for operations which have finished but for which the t=status page has not been accessed since completion) will remain valid for an hour, or for the total time consumed by the operation, whichever is greater. Clients can override this timeout by providing a retain-for= queryarg in the t=status request: the handle will be expired this many seconds after the request.
  • collected handles (i.e. the t=status page has been retrieved at least once since the operation completed) will remain valid for ten minutes, unless overridden by the retain-for= queryarg.

...

POST $URL?t=deep-check (must add &ophandle=XYZ)

This triggers a recursive walk of all files and directories reachable from the target, performing a check on each one just like t=check. The result page will contain a summary of the results, including details on any file/directory that was not fully healthy.

t=deep-check can only be invoked on a directory. An error (400 BAD_REQUEST) will be signalled if it is invoked on a file. The recursive walker will deal with loops safely.

This accepts the same verify=, when_done=, and return_to= arguments as t=check.

Since this operation can take a long time (perhaps a second per object), the ophandle= argument is required (see "Slow Operations, Progress, and Cancelling" above). The response to this POST will be a redirect to the corresponding /operations/$HANDLE?t=status page (with output=HTML or output=JSON to match the output= argument given to the POST). The deep-check operation will continue to run in the background, and the /operations page should be used to find out when the operation is done.

The HTML /operations/$HANDLE?t=status page for incomplete operations will contain a meta-refresh tag, set to 30 seconds, so that a browser which uses deep-check will automatically poll until the operation has completed.

The JSON page (/options/$HANDLE?t=status&output=JSON) will contain a machine-readable JSON dictionary with the following keys:

  • finished: a boolean, True if the operation is complete, else False. Some of the remaining keys may not be present until the operation is complete.
  • root-storage-index: a base32-encoded string with the storage index of the starting point of the deep-check operation

...

comment:5 Changed at 2008-10-08T12:53:48Z by zooko

I like the general design of poll-handles having different timeouts if they are unqueried vs. queried at least once.

Maybe the user could specify how long to leave the poll handles valid?

And maybe there could be an explicit "release" operation to invalidate one?

For some deployments, say a friendnet grid in which I'm backing up my stuff to storage servers owned by friends, using a gateway owned either by me or by a friend, and a client owned by me, I would probably want unqueried poll-handles to live for a few days.

comment:6 Changed at 2008-10-08T19:29:28Z by warner

yeah, I suppose it entirely depends upon what the operation is and what you're planning to do with it. A deep-repair has value even if you never look at the results, but if/when we extend this approach to include uploads themselves, then an unlinked upload's results are pretty critical.

What if we just let each t=status query include an argument for how much longer the handle should be kept alive, with some reasonable defaults. The "release" operation would then just be a "t=status&retainfor=0" query. The "reasonable" default would be the timeouts listed above, but an application which knows how/when it plans to do the queries could completely ignore+override them.

comment:7 Changed at 2008-10-09T17:39:35Z by zooko

Okay, and also can I control how long the handle will be kept before my first t=status query? Like perhaps pass retainfor=1000 to the t=deepcheck?

comment:8 Changed at 2008-11-06T19:46:12Z by warner

  • Milestone changed from undecided to 1.3.0
  • Resolution set to fixed
  • Status changed from new to closed

This code is all in place. The queryargs are:

  • retain-for=
  • release-after-complete=True

and the default timer values (if you don't pass retain-for=) are:

  • handles will remain valid at least until their operation finishes
  • uncollected handles for finished operations (i.e. handles for operations which have finished but for which the GET page has not been accessed since completion) will remain valid for one hour, or for the total time consumed by the operation, whichever is greater.
  • collected handles (i.e. the GET page has been retrieved at least once since the operation completed) will remain valid for ten minutes.
Note: See TracTickets for help on using tickets.