[tahoe-lafs-trac-stream] [Tahoe-LAFS] #3793: allocate_buckets() should always list all shares in its response
Tahoe-LAFS
trac at tahoe-lafs.org
Fri Sep 17 14:21:07 UTC 2021
#3793: allocate_buckets() should always list all shares in its response
--------------------------+-----------------------------------
Reporter: itamarst | Owner: itamarst
Type: defect | Status: new
Priority: normal | Milestone: HTTP Storage Protocol
Component: unknown | Version: n/a
Resolution: | Keywords:
Launchpad Bug: |
--------------------------+-----------------------------------
Comment (by exarkun):
`allocate_buckets` is the network storage API for preparing to upload new
immutable shares.
`storage_index` and `sharenums` serve to address the user data and
`allocated_size` tells the storage server how much user data the client
wants to upload. Note that the given share numbers are only those the
client intends to upload to this storage server at this time. There may
be other shares being uploaded to other servers and a client may choose to
upload more shares for the same storage index to this server at a later
time.
Assuming only clients following Tahoe's CHK protocol, there is at most one
complete set of shares per storage index. This means that there is only
ever one possible length for a particular storage index. It also means
that there is only ever one possible sequence of bytes for any (storage
index, share number) tuple.
What's the point? If the missing shares are going to be represented
somewhere - where? There are three options:
* In *allocated*
* In *writeable buckets*
* Somewhere else
The points about what data can possibly be uploaded seem relevant in
reasoning about the consequences of resolving this issue by putting the
shares into *writeable buckets* (thus allowing more than one caller to
write data for a given (storage index, share number) tuple). It *seems*
somewhat encouraging for this case. If both callers decide to use their
bucket writer, they *should* both write exactly the same byte sequence.
This means first-wins, last-wins, or reject-duplicate should all result in
the same shares on the server in the end.
First-wins and last-wins both let a client upload redundant data. Reject-
duplicate *could* avoid allowing redundant uploads if it rejected the
upload soon enough. It seems like the current behavior is an attempt at
that: if a bucket writer exists already, deny the new caller the chance to
get a redundant bucket writer. It does this at the cost of making part of
the invariant subtle (subtler than I first appreciated): The union of
allocated and writeable buckets equals the union of the requested shares
and the *in-progress* uploads.
With that understanding of the invariant, I don't think the current
storage server interface or behavior is wrong - though the implicit third
group is maybe not ideal.
However, the real motivation question here is how this interface can be
mapped to Great Black Swamp. The Foolscap protocol makes this work by
maintaining the in-progress uploads as connection state with the client
performing the upload. If the connection is destroyed and the upload is
incomplete, the data is wiped and the share moved from the in-progress set
to the "writeable buckets" set (and a subsequent call can try the upload
again).
Because of its HTTP foundation, Great Black Swamp can only have connection
state that lives as long as a single request (let's just call it "request
state" instead). Great Black Swamp also splits uploads across multiple
requests. Therefore the *connection* gives the server no way to recognize
an abandoned upload attempt. Say a client allocates (storage index[x],
share number[y]) and then writes the first half of the share with one
request. If it never makes another request to upload the rest of the
data, how does the server know (x, y) is abandoned?
Or put more practically, how do a client and server collaborate to allow
interrupted uploads to be resumed? Maybe it is worth noting that
"interrupted" here probably involves more than just a network
interruption. Both Foolscap and GBS should deal with network
interruptions fine. Instead, this interruption is more like the storage
server being restarted, the client being restarted, or maybe a mostly
unrelated client trying to upload the same data as was previously
partially uploaded.
Here are some possible answers:
1. Require detection that the share is abandoned before allowing a new
`allocate_buckets` call to retry the upload. This is like the current
implicit third group of partially uploaded shares that appear nowhere in
the response of `allocate_buckets`.
* Assume an upload is abandoned after it is untouched for a set period
of time.
* Assume an upload is abandoned after the storage server restarts (this
is what is currently implemented).
* Assume an upload is abandoned when disk space begins to be constrained
(preferring to clean up older partial shares first).
* Require an explicit API call for abandoning an upload (this is also
currently implemented).
2. Give out a bucket writer for a partially written share. This is like
putting partially uploaded shares into *writeable buckets* in the
`allocate_buckets` response.
* First write wins
* Last write wins
* Reject overlapping writes (current GBS behavior)
3. Change the protocol to be able to introduce an explicit third set
I'm just gonna stop there, possibly I should have stopped a while ago. I
feel like this problem would benefit a lot from some more rigorous form of
analysis than I've approached here. Maybe some of the ideas here at least
provide some useful context for such an analysis?
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3793#comment:1>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list