[tahoe-dev] automatic repair/renewal : where should it go?

Thu Aug 27 01:18:52 PDT 2009

I'd like to provoke a discussion about file repair, to see if we can
make some progress on improving the process. The main tickets on this
issue are #643 "(automatic) Tahoe Repair process", in which newcomers
sensibly think that files are repaired automatically (and are
disappointed to learn that they are not), and #483 "repairer service".
#543 (rebalancing manager) and 777 (CLI command to deep-renew all
aliases, not just one) are peripherally related.

There are actually three periodic operations that need to be done. The
first is lease-renewal (locating all shares and updating their lease
timers). This may or may not be necessary in the future, depending upon
how we do Accounting, but for now, if your servers are configured to
expire shares at all, then clients are obligated to update a lease on
every file on a regular basis. The second is file-checking (locating and
counting all shares). The third is file-repair (if the file-checker says
there are too many missing shares, make new ones).

Lease-renewal currently uses a distinct backend storage-server call. The
file-checker code has an option to additionally renew leases at the same
time (the "do you have a share" and the "please renew my lease on this
share" calls are pipelined, so it does not add roundtrips). I felt that
it made more sense to add it to the file-checker code than, say, the
download code, because the Downloader can stop after finding "k" shares,
whereas the Checker is obligated to find all of them, and of course
lease-renewal needs to hit all shares too.

Repair is currently implemented as a Downloader and an Uploader glued
together, bypassing the decryption step. I think this is the best
design, although I'm also looking forward to the Uploader being taught
to handle existing shares better. The current Repairer is pretty
inefficient when the server list has changed slightly: it will readily
put multiple shares on the same server, and you'll easily wind up with
multiple copies of any given share. Some day that will be better (#610),
but for now the process works well enough.

Now, I've gone back and forth on where I've thought the higher-level
"Repair Process" functionality ought to be. There are two main use
cases. One is the normal end-user, who has one client node, and nobody
else who will do work on their behalf. This user needs to do any
checking/repairing/renewing all by themselves. The other is the
allmydata.com case, where end-users are paying a central service to
handle maintenance tasks like this, so repair/etc will be done by a
different machine (which preferably does not have access to the
plaintext). Smaller grids might also use a central service for this sort
of thing: some of the storage-server operators might also be willing to
run repair services for their friends.

Now, I've tried to build the Tahoe code and its interfaces from the
bottom up: when we weren't sure how to build the high-level things, I
tried to at least provide users/developers with the low-level tools to
assemble things like a Repair process on their own. Adding --repair to
the deep-check webapi operation and CLI command is an example of this. A
normal end-user can nominally get all of their renew/repair needs taken
care of with a cron job that runs "tahoe deep-check --repair --add-lease
ALIAS:".

But, that's not very convenient. You don't know how long it will take,
and you don't want subsequent runs to overlap, so you don't know how
frequently to schedule the cronjob. Very large directory structures
could take days (one allmydata customer's tree took a few weeks to
traverse). If it gets interrupted, you lose all the progress you've made
in that time. And there's no way to prioritize file-checking over
repair, or repair of some objects (like directories) over files, or the
most damaged objects over less-damaged objects, or to defer repair until
later. Transient unavailability of a server or two will look just like
damage, so if you repair right away, you'll be doing a lot more work
than necessary (you might want to defer repair until you've seen the
same "damage" persist for a couple of days, or weeks).

There are ways to address some of these with additional "userspace"
tools (ones which live above the webapi or CLI layer), but it's
difficult and touchy. Many of those goals require a persistence layer
which can remember what needs to be done, and a long-running process to
manage scheduling, and then the code would spend most of its life
speaking HTTP and JSON to a nearby client node, telling it what to do
and interpreting the results.

Since the Tahoe node already has these things, it probably makes sense
to perform these tasks in "nodespace" (below the webapi layer, in the
tahoe client process itself), with perhaps some kind of webapi/CLI layer
to manage or monitor it. The code that runs these tasks would then run
faster (direct access to IClient and IFileSystemNode objects, no HTTP or
JSON parsing in the way), and there would only be one service to start.

The second use case (allmydata.com, central services) *really* wants a
prioritizing scheduler, since it is combining jobs from thousands of
users, examining tens or hundreds of millions of files. Also, if those
users have files in common, the central process can save time by only
checking each file once.

We've gone back and forth over the design of these services, as we
alternately try to emphasize the "run your own node, let it manage
things for you" use case, or the "central services will manage things
for you" case, as well as the "tahoe will do this for you" vs "here are
the tools, go write it yourself" thing. There aren't really too many
differences between the goals for a node-local Repairer service and a
centrally-managed one:

 * a local repair service would be allowed to look at real filecaps,
   whereas a central one should be limited to repaircaps/verifycaps
 * a local repair service may run few enough jobs to be satisfied with a
   single worker client node. A central service, providing for thousands
   of users, may require dozens of worker nodes running in parallel to
   make reasonable progress
 * a local repair service would be a part of the tahoe client node,
   displaying status through the webapi, and configured through
   tahoe.cfg and CLI commands. Its presence should not increase the
   install-time load of Tahoe (i.e. no additional dependencies or GUI
   libraries, etc). A central service, living outside the context of a
   client node, may have other UI avenues (Gtk?), and can justify
   additional dependencies (MySQL or something).

We cannot yet repair read-only mutable files (#625, #746), and we
require readcaps for directory traversal (#308), so deep-repair
currently requires directory writecaps. This may be marginally
acceptable for the allmydata.com central server, but not for a
friendnet's repair services. One plan we've discussed would be to have
client nodes build a "manifest" of repaircaps and submit it to a central
service. The service would maintain those files until the client
replaced the manifest with a new version. (this would bypass the #308
problem, by performing traversal on the client, but would still hit the
can-only-repair-mutable-writecaps problem). So, some of these goals must
be deferred.

Now, the "repair service" design that we've considered, the kind that
would live inside a tahoe client node and work on the behalf of a single
local user, would probably look like this:

 * use a persistent data structure (probably SQLite), to track history
   and manage event scheduling across node reboots

 * periodically (perhaps weekly), do a "deep-check --add-lease" on
   certain rootcaps, as named by tahoe.cfg . Keep track of which
   dirnodes have been visited, to avoid losing too much progress when a
   node bounce occurs during the scan.

 * occasionally do a "check --verify" to run the Verifier on each file,
   probably as a random sampling, to confirm that share data is still
   retrievable and intact. This is significantly more bandwidth
   intensive than merely sending "do you have share" queries (even more
   intensive than regular download, since it usually downloads all N
   shares, not just k). So it must be rate-limited and balanced against
   other needs.

 * record damaged files in the DB. Maybe record a deep-size value for
   quick queries. Maybe record information about all files.

 * a separate process would examine the records of damaged files, sort
   the weakest ones to the top, apply repair policy to decide which
   should be repaired, and begin repair work

 * bandwidth/CPU used by the checker loop and the repairer loop should
   be limited, to prioritize other Tahoe requests and other non-Tahoe
   uses of the same host and network

 * provide status on the repair process, how much work is left to go,
   ETA, etc

 * maybe, if we cache where-are-the-shares information about all files,
   then we can provide an interface that says "server X is going away /
   has gone away, please repair everything that used it". This could
   provide faster response to server loss than a full Checker pass of
   all files. The Downloader might also take advantage of this cache to
   speed up peer-selection.

One big challenge with the Checker/Repairer is that, for the most part,
its job will be very very bursty. It will spend months looking at files
to determine that, yes, that share is still there. Then, a server will
go away, and boom, there are thousands of files that need repair. Or,
the server will go away for a few hours, and any files that happen to
get checked during that time will appear damaged, but if the Checker is
run again later, they'll be back to normal. When a server really has
gone away, there will be a lot of repair work to do.

The distribution gets better as you have more servers, but even so, it
will probably help to make the "R" repair threshold be fuzzy instead of
a strict cutoff. The idea would be:

 * be willing to spend lots of bandwidth on repairing the really weak
   files (those closest to the edge of unrecoverability). If you have a
   3-of-10 encoded file with only 3 shares left, drop everything and
   repair it quick
 * then spend less bandwidth on repairing the less-damaged files
 * once you're down to all files having >R shares, still spend some
   bandwidth randomly repairing some of them, slowly

You want to slowly gnaw away at the lightly-damaged files, making them a
little bit healthier, so that when a whole server disappears, you'll
have less work to do. The Repairer should be able to make some
predictions/plans about how much bandwidth is needed to do repair: if
it's losing ground, it should tell you about it, and/or raise the
bandwidth cap to catch up again.

So.. does this seem reasonable? Can people imagine what the schema of
this persistent store would look like? What sort of statistics or trends
might we want to extract from this database, and how would that
influence the data that we put into it? In allmydata.com's pre-Tahoe
"MV" system, I really wanted to track some files (specifically excluded
from repair) and graph how they degraded over time (to learn more about
what the repair policy should be). It might be useful to get similar
graphs out of this scheme. Should we / can we use this DB to track
server availability too?

How should the process be managed? Should there be a "pause" button? A
"go faster" button? Where should bandwidth limits be imposed? Can we do
all of this through the webapi? How can we make that safe? (i.e. does
the status page need to be on an unguessable URL? how about the control
page and its POST buttons?). And what's the best way to manage a
loop-avoiding depth-first directed graph traversal such that it can be
interrupted and resumed with minimal loss of progress? (this might be a
reason to store information about every node in the DB, and use that as
a "been here already, move along" reminder).

cheers,
 -Brian