[tahoe-dev] behavior of an immutable file repairer

Sun Oct 26 19:56:11 PDT 2008

Folks:

I was on an airplane today headed for ACM CCS 2008, and I did some  
work on immutable file
repairer.  Here is the current docstring and constructor signature.   
Comments welcome!

Regards,

Zooko

class ImmutableFileRepairer(object):
     """ I have two phases -- check phase and repair phase.  In the
     first phase -- check phase, I query servers (in
     permuted-by-storage-index order) until I am satisfied that all M
     uniquely-numbered shares are available (or I run out of servers).

     If the verify flag was passed to my constructor, then for each
     share I download every data block and all metadata from each
     server and perform a cryptographic integrity check on all of it.
     If not, I just ask each server "Which shares do you have?" and
     believe its answer.

     In either case, I wait until I have either gotten satisfactory
     information about all M uniquely-numbered shares, or have run out
     of servers to ask.  (This fact -- that I wait -- means that an
     ill-behaved server which fails to answer my questions will make me
     wait indefinitely.  If it is ill-behaved in a way that triggers
     the underlying foolscap timeouts, then I will wait only as long as
     those foolscap timeouts, but if it is ill-behaved in a way which
     placates the foolscap timeouts but still doesn't answer my
     question then I will wait indefinitely.)

     Then, if I was not satisfied that all M of the shares are
     available from at least one server, and if the repair flag was
     passed to my constructor, I enter the repair phase.  In the repair
     phase, I generate any shares which were not available and upload
     them to servers.

     Which servers?  Well, I take the list of servers and if I was in
     verify mode during the check phase then I exclude any servers
     which claimed to have a share but then failed to serve it up, or
     served up a corrupted one, when I asked for it.  (If I was not in
     verify mode, then I don't exclude any servers, not even servers
     which, when I subsequently attempt to download the file during
     repair, claim to have a share but then fail to produce it, or
     produce a corrupted share, because when I am not in verify mode
     then I am dumb.)  Then I perform the normal server-selection
     process of permuting the order of the servers by the storage
     index, and choosing the next server which doesn't already have
     more shares than others.

     My process of uploading replacement shares proceeds in a
     segment-wise fashion -- first I ask servers if they can hold the
     new shares, and once enough have agreed then I download the first
     segment of the file and upload the first block of each replacement
     share, and only after all those blocks have been uploaded do I
     download the second segment of the file and upload the second
     block of each replacement share to its respective server.  (I do
     it this way in order to minimize the amount of downloading I have
     to do and the amount of memory I have to use at any one time.)

     If any of the servers to which I am uploading replacement shares
     fails to accept the blocks during this process, then I just stop
     using that server, abandon any share-uploads that were going to
     that server, and proceed to finish uploading the remaining shares
     to their respective servers.  At the end of my work, I produce an
     object which satisfies the ICheckAndRepairResults interface (by
     firing the deferred that I returned from start() and passing that
     check-and-repair-results object).

     Along the way, before I send another request on the network I
     always ask the "monitor" object that was passed into my
     constructor whether this task has been cancelled (by invoking its
     raise_if_cancelled() method).
     """
     def __init__(self, client, verifycap, servers, verify, repair,  
monitor):
         assert precondition(isinstance(verifycap, CHKFileVerifierURI))
         assert precondition(isinstance(servers, set))
         for (serverid, serverrref) in servers:
             assert precondition(isinstance(serverid, str))