[tahoe-dev] behavior of an immutable file repairer
zooko
zooko at zooko.com
Sun Oct 26 19:56:11 PDT 2008
Folks:
I was on an airplane today headed for ACM CCS 2008, and I did some
work on immutable file
repairer. Here is the current docstring and constructor signature.
Comments welcome!
Regards,
Zooko
class ImmutableFileRepairer(object):
""" I have two phases -- check phase and repair phase. In the
first phase -- check phase, I query servers (in
permuted-by-storage-index order) until I am satisfied that all M
uniquely-numbered shares are available (or I run out of servers).
If the verify flag was passed to my constructor, then for each
share I download every data block and all metadata from each
server and perform a cryptographic integrity check on all of it.
If not, I just ask each server "Which shares do you have?" and
believe its answer.
In either case, I wait until I have either gotten satisfactory
information about all M uniquely-numbered shares, or have run out
of servers to ask. (This fact -- that I wait -- means that an
ill-behaved server which fails to answer my questions will make me
wait indefinitely. If it is ill-behaved in a way that triggers
the underlying foolscap timeouts, then I will wait only as long as
those foolscap timeouts, but if it is ill-behaved in a way which
placates the foolscap timeouts but still doesn't answer my
question then I will wait indefinitely.)
Then, if I was not satisfied that all M of the shares are
available from at least one server, and if the repair flag was
passed to my constructor, I enter the repair phase. In the repair
phase, I generate any shares which were not available and upload
them to servers.
Which servers? Well, I take the list of servers and if I was in
verify mode during the check phase then I exclude any servers
which claimed to have a share but then failed to serve it up, or
served up a corrupted one, when I asked for it. (If I was not in
verify mode, then I don't exclude any servers, not even servers
which, when I subsequently attempt to download the file during
repair, claim to have a share but then fail to produce it, or
produce a corrupted share, because when I am not in verify mode
then I am dumb.) Then I perform the normal server-selection
process of permuting the order of the servers by the storage
index, and choosing the next server which doesn't already have
more shares than others.
My process of uploading replacement shares proceeds in a
segment-wise fashion -- first I ask servers if they can hold the
new shares, and once enough have agreed then I download the first
segment of the file and upload the first block of each replacement
share, and only after all those blocks have been uploaded do I
download the second segment of the file and upload the second
block of each replacement share to its respective server. (I do
it this way in order to minimize the amount of downloading I have
to do and the amount of memory I have to use at any one time.)
If any of the servers to which I am uploading replacement shares
fails to accept the blocks during this process, then I just stop
using that server, abandon any share-uploads that were going to
that server, and proceed to finish uploading the remaining shares
to their respective servers. At the end of my work, I produce an
object which satisfies the ICheckAndRepairResults interface (by
firing the deferred that I returned from start() and passing that
check-and-repair-results object).
Along the way, before I send another request on the network I
always ask the "monitor" object that was passed into my
constructor whether this task has been cancelled (by invoking its
raise_if_cancelled() method).
"""
def __init__(self, client, verifycap, servers, verify, repair,
monitor):
assert precondition(isinstance(verifycap, CHKFileVerifierURI))
assert precondition(isinstance(servers, set))
for (serverid, serverrref) in servers:
assert precondition(isinstance(serverid, str))
More information about the tahoe-dev
mailing list