[tahoe-dev] Services, Managers, and Methods: on recent consolidation of tickets

Tue Dec 29 22:24:49 PST 2009

Our Trac instances has been very busy lately! It's been exciting to see
dozens of ticket updates happening every day. David-Sarah Hopwood, in
particular, is like a tagging/commenting/categorizing machine! Way to
go!

Since some of the work has been to consolidate tickets that appear to
address the same issue in different ways, I thought I should mention
some ideas and frameworkish concepts that should be made more explicit.
There are some notions that have been in our heads and on whiteboards
for a while, and implicitly referenced by the terminology we've used in
tickets (especially older ones), which trac editors should be aware of.

A Tahoe grid wants maintenance. Some tasks which want to be done on a
periodic basis include:

 * enumeration of reachable files for any given rootcap
 * lease renewal
 * checking of files, with optional verification of ciphertext integrity
 * repair of files found to be unhealthy by the checker
 * rebalancing of shares after server churn
 * accounting
   * currently this means calculating deep-size for user rootcaps
   * eventually it will include other stuff

There are several different places where these tasks might be performed,
with good and bad points to each (for different parties). Our ideas of
where these tasks should run have changed over time (as different use
cases rise and fall), and as the realities of implementation become more
and less obvious:

 * by end-user client machines
 * by other client machines, offered by benevolent grid members
 * by centralized services offered by a company like AllMyData

In addition, the implementation and invocation of these tasks can
conceptually (and code-wise) reside in a couple of different places:

 * method calls on FileNode instances inside a Tahoe process, like
   client.create_node_from_uri(filecap).check_and_repair()
 * specialized services that live inside a Tahoe process, like
   client.getServiceByName("checker").add_job(filecap)
 * standalone processes living in a centralized host, reached via HTTP
   or other RPC method

We first imagined a lot of these tasks being run on AllMyData hardware
on behalf of its customers. Checking and repairing in a centralized
grid, in particular, is a lot faster if you do it on a machine with
gigabit local bandwidth to the servers than if you do it from the wrong
end of a DSL connection. We designed a lot of the encoding formats to
support this goal, in particular the choice to put integrity information
outside the ciphertext rather than inside (encrypt-then-sign, instead of
sign-then-encrypt). This allows a ciphertext-only Verifier. The
erasure-coding approach was also chosen to enable this, so that a
Repairer (at least the immutable repairer) can create new shares without
plaintext access.

What we imagined was a big machine with a database of work to be done,
indexed by verifycaps or storage-index values. It would be fed by
end-user clients who ran a "tahoe manifest" (to get a list of verifycaps
for all reachable files) every once in a while. The deal would be that
anything on your most-recently-submitted list would be maintained for
you, as part of your AllMyData subscription price. It would walk this
list on a regular basis, doing a Checker operation on everything.
Anything that showed signs of trouble would be put on a queue to the
Repairer, which would use the same database and spend all its time
repairing things. It would also keep track of file health over time,
allowing us to learn valuable data about how files in a large real-life
Tahoe grid degrade without repair (we could seed it with files that were
excluded from repair, to track their dissolution, and thus set repair
thresholds and check periods and stuff).

Since this machine would basically know where every share was located,
it could also remember how large they are, and whose manifests reference
them, and could therefore provide a useful measure of Accounting
information. Users could choose to hide files from this machine, and
therefore from the accounting and/or billing system, but then they would
have to take responsibility for their own maintenance on those files. We
figured this was a sufficient soft-enforcement mechanism to start with.

But we knew that end-users would need access to these tools too, and
that really the end-user access should come first, with the central
service being a convenient optimization. AllMyData customers could run
the end-user tools themselves if they wanted. So we built the
client-side tools first. And we weren't really sure how the
database-driven manager service thingy should work, so we built really
small tools: methods on individual filenode objects, and a few
deep-traversal tools to drive them, and some webapi calls to invoke
them, and some CLI tools to make those calls. This is as far as we've
gotten along this path: we haven't yet really built any of the larger
tools.

So we currently have bin/tahoe's manifest, deep-size, check, and
deep-check commands, along with their --repair and --add-lease options.
With these, in theory, end users can simply run a periodic deep-check
--repair --add-lease call to perform maintenance on everything they can
reach from their rootcap.

Where that first runs into trouble is on larger filesystems, where it
takes a very long time to traverse everything (there was one allmydata
customer who had something like a million dirnodes, and that account
took several weeks to simply build a manifest: actually running the
Checker on each object would take even longer). The "deep-check" webapi
operation is not persistent across node reboots, and the CLI command
that drives it is not persistent across HTTP connections.

Another problem is that most common reason for the Checker to notice
missing shares is that the hosting server is temporarily offline.
Servers get bounced all the time, they have kernel crashes, and recently
the USB sticks that we oh-so-cleverly used to host the prodnet servers'
kernel and root filesystem have demonstrated that flash memory doesn't
like to be treated like a fixed disk (I still think it's a clever idea,
but either you have to replace them every six months, or spend real
engineering time to mount them read-only or something).

So it's not really appropriate for the Repairer to be invoked at the
first sign of trouble. We'd really prefer the Checker to pay attention
to changes over time: if it sees that a share used to be on server XYZ,
but isn't there now, tolerate it, but once we haven't heard from XYZ in
a while, then maybe it's time to find a new home for that share.

So we'd like to have something more sophisticated than a single
short-lived short-memory short-sighted CLI command. The "maintenance
service" that we've envisioned to address these would be a specialized
service (i.e. twisted.application.service.Service instance) that lives
inside the Tahoe client node, and keeps track of what work it needs to
do with a local sqlite database. This client node is the end-user's
agent for interacting with the Tahoe grid, and we expect it to be
running most of the time (especially if it's providing storage service
to others). So a component that lives inside it is in a good position
to:

 * hold user secrets
 * manage bandwidth and other resources, allocating more or less to
   maintenance according to what else the user is doing
 * perform maintenance work on behalf of its user which no other node is
   incentivized to do
 * communicate with other services, to offload work to nodes which do
   have an incentive to help, and which the user is willing to rely upon

A likely behavior for this maintenance service or "manager" would be to
periodically traverse the user's rootcaps and maintain an up-to-date
list of verifycaps, in a database table. Then it would spend some amount
of bandwidth running a Checker on each one (favoring the ones that had
stale results, or which were unhealthy the last time). Anything that
indicated problems would be passed to a Repairer queue, just like with
the "big machine" we were imagining. The maintenance service would churn
away in the background, and relieve the user from the inconvenient and
touchy "tahoe deep-check --add-lease --repair" cronjob, and do a better
job of it too.

If there were other nodes that would help with this work, the manager
could be configured to talk to them and send work their way. If factored
correctly, this service could be split into pieces, with some parts run
on one node and others run remotely. The end-user machine could create
the manifest of verifycaps and send them to a remote node for storage in
a database. That remote node could manage the database but farm out
checker/repair work to other worker nodes. More worker nodes could be
added as load demanded. And so on.

Eventually, that would turn into the "big machine" we first envisioned
for the allmydata grid. It would be built as a farm of worker nodes, but
end-user machines could just be configured with a couple of FURLs in
their tahoe.cfgs to tell them where the manifests should be sent. Then
AllMyData could give their customers a choice of operation modes:

 * keep your own rootcaps, perform all maintenance yourself (your
   machine must be online much of the time)
 * keep your own rootcaps, send us a manifest every week, we'll maintain
   whatever you send us. You need to get online sometimes.
 * give us your rootcaps (or traversalcaps once we invent them), we'll
   maintain everything. Turn off your computer for a year, that's fine,
   as long as you keep your account paid up then everything will be kept
   maintained.

And of course friendnets and other grids could use these same tools to
implement whatever sort of arrangements they wanted.

So, while we haven't built a lot of this yet, and the allmydata use case
is not paying my salary anymore, I still think it's the right direction
to go. I built "tahoe deep-check" before building a maintenance service
because I knew that'd be the minimum necessary tool, but I think the
full Service must be built before Tahoe will feel like a comfortable and
easy-to-use long-term data storage system.

Anyways, the reason I wanted to do a braindump of this bit of historical
direction-setting was that we have a lot of tickets that reflect these
various wandering goals. We started with a simple "milestones.txt" file
in the source tree, with categories like "Repairer" and "Uploader". The
lines of that file were turned into tickets long ago, and many of those
tickets refer to things like "build a Repairer service" and "managers"
and stuff. Later tickets were used to track individual tasks like adding
FileNode.check_and_repair(). Occasionally new ones were added to remind
us that we still wanted those "big machine" services someday.

So, for example, #543 "rebalancing manager" is about a centralized
service that concentrates on moving shares around into their "correct"
places. Individual clients could do the same job when they perform a
repair, but if we assume that these clients have minimal bandwidth to
the servers, they'd probably prefer to hold off until the situation got
really bad. A central rebalancing manager would probably be given more
authority than clients: it would be allowed to tell servers to delete
shares (once they'd been copied elsewhere), which clients should
probably not be able to do (unless/until we create destroycaps or
something).

You could imagine this kind of functionality living in several different
places: as a subroutine of the "tahoe check --repair" call, as an
in-client service which walks a list of verifycaps, or as a central
service with additional authority and global knowledge of how full each
server was and sysadmin buttons like "prepare to decomission server
XYZ". #543 is about the central service aspect of this, while #699
(rebalance during repair or upload) and #232 (reblance shares on mutable
publish) are about the small-scale check-time or upload-time form. And
#864 is a storage-server-centric form (not centralized, and not on the
client, but where each server participates in the process), and #481 was
the manual-tool run-on-central-machine form, something that an allmydata
sysadmin would run (with storage-server authority) on the servers as
necessary.

Similarly, #643 "automatically schedule repair process", despite its
initial confusion about the role of the Introducer, is about this sort
of automatic in-Node repair service. Whereas #483 "repairer service" was
the earlier expression of the centralized process, and in which we
treated repair the other way around (instead of FileNode.repair(), it
talks about Repairer.repair(verifycap)). Etc. There are several other
instances of this pattern in Trac.

So, when we're gardening these tickets and consolidating and redefining
them, keep in mind this bit of tangled history and our long-term goals.
We'll need primitives to provide much of this functionality on a
small-scale basis, like maybe FileNode.rebalance or
FileNode.repair(rebalance=True) or something. But a lot of it needs to
be done in a yet-to-be-written in-node persistent service, and we should
eventually be able to build such services into centralized standalone
processes.

If we can find better terminology for the different ways of deploying
these functions, I'm all for using it. "service", "manager", "process",
"component", "node": all of them are pretty vague. If we'd had better
words, then much of what I've written above would be more obvious and I
could have written less :).

cheers,
 -Brian