[tahoe-lafs-trac-stream] [tahoe-lafs] #1851: new immutable file upload protocol: streaming, fewer round-trips, quota-respecting
tahoe-lafs
trac at tahoe-lafs.org
Sat Nov 16 04:29:19 UTC 2013
#1851: new immutable file upload protocol: streaming, fewer round-trips, quota-
respecting
-------------------------+-------------------------------------------------
Reporter: zooko | Owner:
Type: | Status: new
enhancement | Milestone: undecided
Priority: normal | Version: 1.9.2
Component: code- | Keywords: upload immutable accounting
storage | performance bandwidth latency forward-
Resolution: | compatibility backward-compatibility
Launchpad Bug: |
-------------------------+-------------------------------------------------
Old description:
> Here is a letter Brian wrote in 2008 about an improved upload protocol:
>
> https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html
>
> The letter describes several improvements. The first couple of
> improvements are about disk-full conditions, quotas, and read-only mode,
> and we've implemented most or all of that. The second part of the letter
> describes a new upload protocol that would be more efficient. Let's
> implement that! Then you can close this ticket.
New description:
Here is a letter Brian wrote in 2008 about an improved upload protocol:
https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html
The letter describes several improvements. The first couple of
improvements are about disk-full conditions, quotas, and read-only mode,
and we've implemented most or all of that. The second part of the letter
describes a new upload protocol that would be more efficient. Let's
implement that! Then you can close this ticket.
--
Comment (by zooko):
Here's the part of Brian's 2008-May letter that I mean for this ticket
(the rest of his letter is already implemented):
"""
Then we plan to modify the immutable-share storage server protocol (which
currently consists of allocate_buckets() and get_buckets()) to get rid of
the
RIBucketWriter objects and instead use a single method as follows:
{{{
def upload_immutable_share(upload_index, storage_index, sharenum,
writev, close=bool,
expected_size=bool_or_None):
return (accepted=bool, remaining_space=int)
}}}
The "{{{upload_index}}}" is an as-yet-unfinished token that allows a
server to upload a share in pieces (one segment per message) without
holding a foolscap Referenceable the whole time. This should improve
resumed uploads. "{{{writev=}}}" is your usual write vector, a list of
{{{(offset, data)}}} pairs. The "{{{close=}}}" flag indicates whether this
is the last segment or not, serving the same purpose as the IPv4 "no more
fragments" bit: when the server sees {{{close=True}}}, it should terminate
the {{{upload_index}}} and make the finished share visible other clients.
If the client doesn't close the {{{upload_index}}} in a timely fashion,
the server can delete the partial share.
{{{expected_size=}}} is advisory, and tells the storage server how large
the client expects this share to become. It is optional: if the client is
streaming a file, it may not know how large the file will be, and cannot
provide an expected size. The server uses this advice to make a guess
about how much free space is left.
If the server accepts the write (i.e. it did not run out of space while
writing the share to disk, and it wasn't in a read-only mode), it returns
{{{accepted=True}}}. It also returns an indication of how much free space
it thinks it has left: this will be the 'df' space, minus the reserved
space, minus the sum of all other {{{expected_size=}}} values (TODO: maybe
it should include this one too, obviously we must be clear about which
approach we take).
The client will use the {{{remaining_space=}}} response to decide whether
it should continue sending segments to this server, or if it thinks that
the server is likely to run out of space before it finishes sending the
share (and therefore might want to switch to a different server before
wasting too much work on the full one).
For single-segment files, the client will generate all shares, then send
them speculatively to N candidate servers (i.e. peer selection will just
return the first N servers in the introducer's list of non-readonly
storage servers). Each share will have just one block, and just one upload
call, in which the {{{close=}}} flag is {{{True}}}. These servers will
either accept the share or reject it (because of insufficient space). Any
share which is rejected will
be submitted to the next candidate server on the permuted list. This
approach gets us a single roundtrip for small files when all servers have
free space. When some servers are full, we lose one block of network
bandwidth for each full server, and add at least one roundtrip. If clients
think that servers are likely to be full and want to avoid the wasted
bandwidth, they could spend an extra roundtrip by doing a small write and
checking the {{{accepted=}}} response before committing to sending the
full block.
For multi-segment files, the client will generate the first segment's
blocks, and send it speculatively to N candidate servers, along with its
{{{expected_size=}}} (if available). These blocks will be retained in
memory until a server accepts them. The client has a choice about how much
pipelining it will do: it may encode additional segments and send them to
the same servers, or it might wait until the responses to the first
segment come back. When those responses come back, the client will drop
any servers which reject the first block, or whose {{{remaining_space=}}}
indicates that the share won't fit.
Dropped servers will be replaced by the next candidate in the permuted
list, and the same blocks are sent again. The client will pipeline some
number of blocks (allowing multiple upload messages to be outstanding at
once, each being retired by a successful ack response) that depends upon
how much memory it wants to spend vs how much of the bandwidth-delay
product it wants to utilize.
The client has a "client soft threshold", which is the minimum
{{{remaining_space=}}} value that it is willing to tolerate. This
implements a tradeoff between storage utilization and chance of uploading
the file successfully on the first try. If this margin is too small, the
client might send the whole share to the server only to have the very last
block be rejected due to lack of space. But if the margin is too high, the
client may forego using mostly-full-but-still-useable servers.
The server cannot provide a guarantee of space. But the probability that a
non-initial block will be rejected can be made very small by:
* all clients providing accurate expected_size= information
* the server maintaining accurate df measurements
* clients paying close attention to the remaining_space= responses
If a client loses this gamble (i.e. the server rejects one of their non-
initial blocks), they must either abandon that share (and wind up with a
less-than-100%-health file, in which fewer than N shares were placed), or
they must find a new home for that share and restart the encoder (which
means more round-trips and possibly more memory consumption.. one approach
would be to stall all other shares while we re-encode the earlier segments
for the new server and catch them up, then proceed forwards with the
remaining segments for all servers in parallel).
Since the chance of being rejected is highest for the first block (since
the client does not yet have any information about the server, indeed they
cannot be sure that the server is still online), it makes sense to hold on
to the first segment's blocks until that response has been received. An
optimistic client which was desperate to reduce memory footprint and
improve throughput could conceivably stream the whole file to candiate
servers without waiting for an ack, then look for responses and restart
encoding if there were any failures.
For streaming/resumeability, the storage protocol could also use a way to
abort an upload (to accelerate the share-unfinished-for-too-long timeout)
when the client decides to move to some other server (because there is not
enough space left).
"""
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1851#comment:2>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list