[tahoe-lafs-trac-stream] [Tahoe-LAFS] #3793: allocate_buckets() should always list all shares in its response

Fri Sep 17 14:21:07 UTC 2021

#3793: allocate_buckets() should always list all shares in its response
--------------------------+-----------------------------------
     Reporter:  itamarst  |      Owner:  itamarst
         Type:  defect    |     Status:  new
     Priority:  normal    |  Milestone:  HTTP Storage Protocol
    Component:  unknown   |    Version:  n/a
   Resolution:            |   Keywords:
Launchpad Bug:            |
--------------------------+-----------------------------------

Comment (by exarkun):

 `allocate_buckets` is the network storage API for preparing to upload new
 immutable shares.

 `storage_index` and `sharenums` serve to address the user data and
 `allocated_size` tells the storage server how much user data the client
 wants to upload.  Note that the given share numbers are only those the
 client intends to upload to this storage server at this time.  There may
 be other shares being uploaded to other servers and a client may choose to
 upload more shares for the same storage index to this server at a later
 time.

 Assuming only clients following Tahoe's CHK protocol, there is at most one
 complete set of shares per storage index.  This means that there is only
 ever one possible length for a particular storage index.  It also means
 that there is only ever one possible sequence of bytes for any (storage
 index, share number) tuple.

 What's the point?  If the missing shares are going to be represented
 somewhere - where?  There are three options:

 * In *allocated*
 * In *writeable buckets*
 * Somewhere else

 The points about what data can possibly be uploaded seem relevant in
 reasoning about the consequences of resolving this issue by putting the
 shares into *writeable buckets* (thus allowing more than one caller to
 write data for a given (storage index, share number) tuple).  It *seems*
 somewhat encouraging for this case.  If both callers decide to use their
 bucket writer, they *should* both write exactly the same byte sequence.
 This means first-wins, last-wins, or reject-duplicate should all result in
 the same shares on the server in the end.

 First-wins and last-wins both let a client upload redundant data.  Reject-
 duplicate *could* avoid allowing redundant uploads if it rejected the
 upload soon enough.  It seems like the current behavior is an attempt at
 that: if a bucket writer exists already, deny the new caller the chance to
 get a redundant bucket writer.  It does this at the cost of making part of
 the invariant subtle (subtler than I first appreciated): The union of
 allocated and writeable buckets equals the union of the requested shares
 and the *in-progress* uploads.

 With that understanding of the invariant, I don't think the current
 storage server interface or behavior is wrong - though the implicit third
 group is maybe not ideal.

 However, the real motivation question here is how this interface can be
 mapped to Great Black Swamp.  The Foolscap protocol makes this work by
 maintaining the in-progress uploads as connection state with the client
 performing the upload.  If the connection is destroyed and the upload is
 incomplete, the data is wiped and the share moved from the in-progress set
 to the "writeable buckets" set (and a subsequent call can try the upload
 again).

 Because of its HTTP foundation, Great Black Swamp can only have connection
 state that lives as long as a single request (let's just call it "request
 state" instead).  Great Black Swamp also splits uploads across multiple
 requests.  Therefore the *connection* gives the server no way to recognize
 an abandoned upload attempt.  Say a client allocates (storage index[x],
 share number[y]) and then writes the first half of the share with one
 request.  If it never makes another request to upload the rest of the
 data, how does the server know (x, y) is abandoned?

 Or put more practically, how do a client and server collaborate to allow
 interrupted uploads to be resumed?  Maybe it is worth noting that
 "interrupted" here probably involves more than just a network
 interruption.  Both Foolscap and GBS should deal with network
 interruptions fine.  Instead, this interruption is more like the storage
 server being restarted, the client being restarted, or maybe a mostly
 unrelated client trying to upload the same data as was previously
 partially uploaded.

 Here are some possible answers:

 1. Require detection that the share is abandoned before allowing a new
 `allocate_buckets` call to retry the upload.  This is like the current
 implicit third group of partially uploaded shares that appear nowhere in
 the response of `allocate_buckets`.
   * Assume an upload is abandoned after it is untouched for a set period
 of time.
   * Assume an upload is abandoned after the storage server restarts (this
 is what is currently implemented).
   * Assume an upload is abandoned when disk space begins to be constrained
 (preferring to clean up older partial shares first).
   * Require an explicit API call for abandoning an upload (this is also
 currently implemented).
 2. Give out a bucket writer for a partially written share.  This is like
 putting partially uploaded shares into *writeable buckets* in the
 `allocate_buckets` response.
   * First write wins
   * Last write wins
   * Reject overlapping writes (current GBS behavior)
 3. Change the protocol to be able to introduce an explicit third set

 I'm just gonna stop there, possibly I should have stopped a while ago.  I
 feel like this problem would benefit a lot from some more rigorous form of
 analysis than I've approached here.  Maybe some of the ideas here at least
 provide some useful context for such an analysis?

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3793#comment:1>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage