[tahoe-lafs-trac-stream] [tahoe-lafs] #1851: new immutable file upload protocol: streaming, fewer round-trips, quota-respecting

Sat Nov 16 04:29:19 UTC 2013

#1851: new immutable file upload protocol: streaming, fewer round-trips, quota-
respecting
-------------------------+-------------------------------------------------
     Reporter:  zooko    |      Owner:
         Type:           |     Status:  new
  enhancement            |  Milestone:  undecided
     Priority:  normal   |    Version:  1.9.2
    Component:  code-    |   Keywords:  upload immutable accounting
  storage                |  performance bandwidth latency forward-
   Resolution:           |  compatibility backward-compatibility
Launchpad Bug:           |
-------------------------+-------------------------------------------------

Old description:

> Here is a letter Brian wrote in 2008 about an improved upload protocol:
>
> https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html
>
> The letter describes several improvements. The first couple of
> improvements are about disk-full conditions, quotas, and read-only mode,
> and we've implemented most or all of that. The second part of the letter
> describes a new upload protocol that would be more efficient. Let's
> implement that! Then you can close this ticket.

New description:

 Here is a letter Brian wrote in 2008 about an improved upload protocol:

 https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html

 The letter describes several improvements. The first couple of
 improvements are about disk-full conditions, quotas, and read-only mode,
 and we've implemented most or all of that. The second part of the letter
 describes a new upload protocol that would be more efficient. Let's
 implement that! Then you can close this ticket.

--

Comment (by zooko):

 Here's the part of Brian's 2008-May letter that I mean for this ticket
 (the rest of his letter is already implemented):

 """
 Then we plan to modify the immutable-share storage server protocol (which
 currently consists of allocate_buckets() and get_buckets()) to get rid of
 the
 RIBucketWriter objects and instead use a single method as follows:

 {{{
  def upload_immutable_share(upload_index, storage_index, sharenum,
                             writev, close=bool,
 expected_size=bool_or_None):
      return (accepted=bool, remaining_space=int)
 }}}

 The "{{{upload_index}}}" is an as-yet-unfinished token that allows a
 server to upload a share in pieces (one segment per message) without
 holding a foolscap Referenceable the whole time. This should improve
 resumed uploads. "{{{writev=}}}" is your usual write vector, a list of
 {{{(offset, data)}}} pairs. The "{{{close=}}}" flag indicates whether this
 is the last segment or not, serving the same purpose as the IPv4 "no more
 fragments" bit: when the server sees {{{close=True}}}, it should terminate
 the {{{upload_index}}} and make the finished share visible other clients.
 If the client doesn't close the {{{upload_index}}} in a timely fashion,
 the server can delete the partial share.

 {{{expected_size=}}} is advisory, and tells the storage server how large
 the client expects this share to become. It is optional: if the client is
 streaming a file, it may not know how large the file will be, and cannot
 provide an expected size. The server uses this advice to make a guess
 about how much free space is left.

 If the server accepts the write (i.e. it did not run out of space while
 writing the share to disk, and it wasn't in a read-only mode), it returns
 {{{accepted=True}}}. It also returns an indication of how much free space
 it thinks it has left: this will be the 'df' space, minus the reserved
 space, minus the sum of all other {{{expected_size=}}} values (TODO: maybe
 it should include this one too, obviously we must be clear about which
 approach we take).

 The client will use the {{{remaining_space=}}} response to decide whether
 it should continue sending segments to this server, or if it thinks that
 the server is likely to run out of space before it finishes sending the
 share (and therefore might want to switch to a different server before
 wasting too much work on the full one).

 For single-segment files, the client will generate all shares, then send
 them speculatively to N candidate servers (i.e. peer selection will just
 return the first N servers in the introducer's list of non-readonly
 storage servers). Each share will have just one block, and just one upload
 call, in which the {{{close=}}} flag is {{{True}}}. These servers will
 either accept the share or reject it (because of insufficient space). Any
 share which is rejected will
 be submitted to the next candidate server on the permuted list. This
 approach gets us a single roundtrip for small files when all servers have
 free space. When some servers are full, we lose one block of network
 bandwidth for each full server, and add at least one roundtrip. If clients
 think that servers are likely to be full and want to avoid the wasted
 bandwidth, they could spend an extra roundtrip by doing a small write and
 checking the {{{accepted=}}} response before committing to sending the
 full block.

 For multi-segment files, the client will generate the first segment's
 blocks, and send it speculatively to N candidate servers, along with its
 {{{expected_size=}}} (if available). These blocks will be retained in
 memory until a server accepts them. The client has a choice about how much
 pipelining it will do: it may encode additional segments and send them to
 the same servers, or it might wait until the responses to the first
 segment come back. When those responses come back, the client will drop
 any servers which reject the first block, or whose {{{remaining_space=}}}
 indicates that the share won't fit.

 Dropped servers will be replaced by the next candidate in the permuted
 list, and the same blocks are sent again. The client will pipeline some
 number of blocks (allowing multiple upload messages to be outstanding at
 once, each being retired by a successful ack response) that depends upon
 how much memory it wants to spend vs how much of the bandwidth-delay
 product it wants to utilize.

 The client has a "client soft threshold", which is the minimum
 {{{remaining_space=}}} value that it is willing to tolerate. This
 implements a tradeoff between storage utilization and chance of uploading
 the file successfully on the first try. If this margin is too small, the
 client might send the whole share to the server only to have the very last
 block be rejected due to lack of space. But if the margin is too high, the
 client may forego using mostly-full-but-still-useable servers.

 The server cannot provide a guarantee of space. But the probability that a
 non-initial block will be rejected can be made very small by:

  * all clients providing accurate expected_size= information
  * the server maintaining accurate df measurements
  * clients paying close attention to the remaining_space= responses

 If a client loses this gamble (i.e. the server rejects one of their non-
 initial blocks), they must either abandon that share (and wind up with a
 less-than-100%-health file, in which fewer than N shares were placed), or
 they must find a new home for that share and restart the encoder (which
 means more round-trips and possibly more memory consumption.. one approach
 would be to stall all other shares while we re-encode the earlier segments
 for the new server and catch them up, then proceed forwards with the
 remaining segments for all servers in parallel).

 Since the chance of being rejected is highest for the first block (since
 the client does not yet have any information about the server, indeed they
 cannot be sure that the server is still online), it makes sense to hold on
 to the first segment's blocks until that response has been received. An
 optimistic client which was desperate to reduce memory footprint and
 improve throughput could conceivably stream the whole file to candiate
 servers without waiting for an ack, then look for responses and restart
 encoding if there were any failures.

 For streaming/resumeability, the storage protocol could also use a way to
 abort an upload (to accelerate the share-unfinished-for-too-long timeout)
 when the client decides to move to some other server (because there is not
 enough space left).
 """

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1851#comment:2>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage