[tahoe-dev] timeouts

Brian Warner warner at lothar.com
Mon Aug 10 12:43:43 PDT 2009


Sam Mason wrote:
> 
> The fixes to those look as though they'd be scattered across the code
> somewhat.

Yeah, this could be a fairly large task, because what really wants to
happen is a rewrite of the Uploader code to behave more like a state
machine than like a big chain of Deferreds. The same thing needs to
happen to the Downloader side (#193/#287).

At the top level, the client has a tradeoff to make, between memory
footprint (and/or temp space on disk and/or having a seekable input file
and being willing to reread it), CPU usage, and overall reliability of
the file. The current code tries to minimize all three by doing only one
pass through the file. It reads one segment of plaintext from the input,
encrypts it, encodes it, sends the resulting blocks to all remaining
active servers, waits for all of them to either acknowledge or fail,
then loops to the next segment. (the pauses/timeouts result from a
silently disconnecting server taking a long time to 'fail'). Any request
that fails causes that server to be removed from the active list, and
any shares we were sending to them are abandoned. If don't make it to
the end of the upload with enough shares in place ("shares of
happiness"), we tell the user that the upload failed, so they can try it
again (which should re-do peer-selection and thus pick up a different
set of servers).

 Note: a "share" is made up of many "blocks" concatenated together, one
 block per 128KiB segment of the input file. A 10MiB file uploaded with
 3-of-10 encoding will have 10 shares, each of which will contain 80
 blocks, and each block will be about 128/3=43KiB in size. All the
 blocks of a share live or die as a unit, but during upload, each block
 is sent in a different remote_write() message. A remote_close() message
 is required to finalize the share and make it available for readers.

 The other approach, in which each segment's encoded data goes to
 different servers, drastically reduces the chance of recovering the
 entire file, known (I think) as the "segmentation problem" or the
 "chunking problem" back when the MojoNation/BitTorrent folks ran into
 it.

The uploader makes one pass and never looks back. If a server dies when
you're 99% of the way through the file, that share is lost completely,
and the bandwidth you've spent on it is wasted. (we've kicked around
some ideas to avoid this, but haven't implemented them yet). To relocate
that share to a different server, you'd have to go back to the first
segment of the file and re-encrypt/re-encode everything. Or, you'd have
to store all of your generated shares on disk (or in RAM) as you went,
just in case you needed to use a different server at some point. (The
pre-Tahoe system known as "Mountain View", which was closely related to
Mnet, wrote everything out to disk first, before uploading it anywhere,
and the transpositional nature of the encoding process thrashed the disk
mightily: shares/blocks fit into a matrix, but it's written row-wise and
then read column-wise).

So those were the tradeoffs: we went for simplicity and low
memory/storage footprint, and figured that storage servers would likely
survive long enough to allow most share uploads to succeed. "shares of
happiness" was the failsafe test.


We can probably improve the timeout/stall behavior without drastic
changes to the client's behavior. Timeouts and stalls would be handled
better with a more state-machine based design. Basically, we'd still
make a one-way pass through the segments, but instead of using a
DeferredList to wait for acks/naks/disconnects from all servers before
moving on to the next segment, there'd be a little state machine, run
once per segment. The input events would be "ack/nak/disconnect
received" and "too much time has passed". If we get a solid response
(one way or the other) from everyone, the segment is done. But if we see
a timeout, then we have a choice to make: give up right away, wait a bit
longer, or let the request remain outstanding and continue on to the
next segment (consuming RAM). The state machine could base its decision
upon the estimated memory required for all outstanding requests
(basically the size of the block). Deciding to give up means lower
memory footprint and faster uploads, but sacrificing reliability if/when
servers are merely slow and not actually disconnected. Waiting longer
improves the chances of retaining slow servers and keeps memory use low,
but uploads will stall in the face of silently disconnected servers.
Holding the segment in RAM keeps things moving and tolerates slow
servers but consumes memory. Deciding what counts as "slow" is tricky:
maybe keep track of how long they took to answer earlier requests, or
compare response time to other servers and look for outliers. The fact
that all uploaded data is going through the same local network pipe may
artificially cause the last segment sent to look slower than the first
(because it gets stuck in the pipeline behind everyone else).

(note: on the download side, it is easier to switch between servers,
because the only setup work you have to do is to fetch the hash chain so
you can validate the new blocks. So, a state machine will help even
more: instead of giving up completely, the downloader can just lower the
priority of the slow server and pull from another one. The tradeoff will
between duplicated block downloads and speed. See #287 for details).

Another likely improvement would be to trigger an immediate repair when
an upload finishes with less than N shares in place. The "UploadResults"
that come back from the upload() method contains information about which
shares were placed where, and could be used to make this sort of
decision. It'd be more efficient to re-upload the same source file,
though, because then we wouldn't need a (probably redundant) download
step.

We'd have to decide what upload() really means though: how the user
should express a difference between "please upload this in the future
when it's convenient for you" and "please upload this right now and
don't tell me you're done until it's really secure". And if they express
the former, should they have a way to ask about the subsequent
autorepair status? Like, should they get two notifications: one when the
initial/weak upload is done, and another when the later repair is
complete? I don't know.

> A better fix would seem to be to send the failing shares off to other
> servers, but if I interpreted the protocol correctly each server knows
> which other servers contain shares and so you'd need some way of
> telling them that things have moved.

As above, doing anything with the failing shares will involve seeking
back to the beginning of the plaintext and re-encoding everything (or
storing everything ahead of time). Also, at the moment, each share is
independent, servers don't talk to each other, and servers do not know
the location of other shares. We've considered adding this information
(#599) as an advisory hint to help downloaders locate other shares
faster, but have not implemented it yet.

cheers,
 -Brian


More information about the tahoe-dev mailing list