[tahoe-lafs-trac-stream] [tahoe-lafs] #510: use plain HTTP for storage server protocol
tahoe-lafs
trac at tahoe-lafs.org
Thu May 30 00:13:15 UTC 2013
#510: use plain HTTP for storage server protocol
------------------------------+---------------------------------
Reporter: warner | Owner: taral
Type: enhancement | Status: new
Priority: major | Milestone: 2.0.0
Component: code-storage | Version: 1.2.0
Resolution: | Keywords: standards gsoc http
Launchpad Bug: |
------------------------------+---------------------------------
Old description:
> Zooko told me about an idea: use plain HTTP for the storage server
> protocol, instead of foolscap. Here are some thoughts:
>
> * it could make Tahoe easier to standardize: the spec wouldn't have to
> include foolscap too
> * the description of the share format (all the hashes/signatures/etc)
> becomes the most important thing: most other aspects of the system can be
> inferred from this format (with peer selection being a significant
> omission)
> * download is easy, use GET and a URL of /shares/STORAGEINDEX/SHNUM,
> perhaps with an HTTP Content-Range header if you only want a portion of
> the share
> * upload for immutable files is easy: PUT /shares/SI/SHNUM, which works
> only once
> * upload for mutable files:
> * implement DSA-based mutable files, in which the storage index is the
> hash of the public key (or maybe even equal to the public key)
> * the storage server is obligated to validate every bit of the share
> against the roothash, validate the roothash signature against the pubkey,
> and validate the pubkey against the storage index
> * the storage server will accept any share that validates up to the SI
> and has a seqnum higher than any existing share
> * if there is no existing share, the server will accept any valid
> share
> * when using Content-Range: (in some one-message equivalent of
> writev), the server validates the resulting share, which is some
> combination of the existing share and the deltas being written. (this is
> for MDMF where we're trying to modify just one segment, plus the modified
> hash chains, root hash, and signature)
>
> Switching to a validate-the-share scheme to control write access is good
> and bad:
>
> * + repairers can create valid, readable, overwritable shares without
> access to the writecap.
> * - storage servers must do a lot of hashing and public key computation
> on every upload
> * - storage servers must know the format of the uploaded share, so
> clients cannot start using new formats without first upgrading all the
> storage servers
>
> The result would be a share-transfer protocol that would look exactly
> like HTTP, however it could not be safely implemented by a simple HTTP
> server because the PUT requests must be constrained by validating the
> share. (a simple HTTP server doesn't really implement PUT anyways). There
> is a benefit to using "plain HTTP", but some of the benefit is lost when
> in fact it is really HTTP being used as an RPC mechanism (think of the
> way S3 uses HTTP).
>
> It might be useful to have storage servers declare two separate
> interfaces: a plain HTTP interface for read, and a separate port or
> something for write. The read side could indeed be provided by a dumb
> HTTP server like apache; the write side would need something slightly
> more complicated. An apache module to provide the necessary share-write
> checking would be fairly straightforward, though.
>
> Hm, that makes me curious about the potential to write the entire Tahoe
> node as an apache module: it could convert requests for /ROOT/uri/FILECAP
> etc into share requests and FEC decoding...
New description:
Zooko told me about an idea: use plain HTTP for the storage server
protocol, instead of foolscap. Here are some thoughts:
* it could make Tahoe easier to standardize: the spec wouldn't have to
include foolscap too
* the description of the share format (all the hashes/signatures/etc)
becomes the most important thing: most other aspects of the system can be
inferred from this format (with peer selection being a significant
omission)
* download is easy, use GET and a URL of /shares/STORAGEINDEX/SHNUM,
perhaps with an HTTP Content-Range header if you only want a portion of
the share
* upload for immutable files is easy: PUT /shares/SI/SHNUM, which works
only once
* upload for mutable files:
* implement DSA-based mutable files, in which the storage index is the
hash of the public key (or maybe even equal to the public key)
* the storage server is obligated to validate every bit of the share
against the roothash, validate the roothash signature against the pubkey,
and validate the pubkey against the storage index
* the storage server will accept any share that validates up to the SI
and has a seqnum higher than any existing share
* if there is no existing share, the server will accept any valid share
* when using Content-Range: (in some one-message equivalent of writev),
the server validates the resulting share, which is some combination of the
existing share and the deltas being written. (this is for MDMF where we're
trying to modify just one segment, plus the modified hash chains, root
hash, and signature)
Switching to a validate-the-share scheme to control write access is good
and bad:
* + repairers can create valid, readable, overwritable shares without
access to the writecap.
* - storage servers must do a lot of hashing and public key computation
on every upload
* - storage servers must know the format of the uploaded share, so
clients cannot start using new formats without first upgrading all the
storage servers
The result would be a share-transfer protocol that would look exactly like
HTTP, however it could not be safely implemented by a simple HTTP server
because the PUT requests must be constrained by validating the share. (a
simple HTTP server doesn't really implement PUT anyways). There is a
benefit to using "plain HTTP", but some of the benefit is lost when in
fact it is really HTTP being used as an RPC mechanism (think of the way S3
uses HTTP).
It might be useful to have storage servers declare two separate
interfaces: a plain HTTP interface for read, and a separate port or
something for write. The read side could indeed be provided by a dumb HTTP
server like apache; the write side would need something slightly more
complicated. An apache module to provide the necessary share-write
checking would be fairly straightforward, though.
Hm, that makes me curious about the potential to write the entire Tahoe
node as an apache module: it could convert requests for /ROOT/uri/FILECAP
etc into share requests and FEC decoding...
--
Comment (by daira):
The cloud backend, which uses HTTP or HTTPS to connect to the cloud
storage service, provides some interesting data on how an HTTP-only
storage protocol might perform. With request pipelining and connection
pooling, it seems to do a pretty good job of maxing out the upstream
bandwidth to the cloud on my home Internet connection, although it would
be interesting to test it with a fatter pipe. (For downloads, performance
appears to be limited by inefficiencies in the downloader rather than in
the cloud backend.)
Currently, the cloud backend splits shares into "chunks" to limit the
amount of data that needs to be held in memory or in a store object (see
[source:docs/specifications/backends/raic.rst]). This is somewhat
redundant with segmentation: ciphertext "segments" are erasure-encoded
into "blocks" (a segment is k = {{{shares.needed}}} times larger than a
block), and stored in a share together with a header and metadata, which
is then chunked. Blocks and chunks are not aligned (for two reasons: the
share header, and the typical block size of 128 KiB / 3, which is not a
factor of the 512 KiB default chunk size). So,
* a sequential scan over blocks will reference the same chunk for several
consecutive requests.
* a single block may span chunks.
The cloud backend uses [https://github.com/LeastAuthority/tahoe-
lafs/blob/1c1aff2b58c121dbf74a781525d3f65060deb54d/src/allmydata/storage/backends/cloud/cloud_common.py#L575
caching] to mitigate any resulting inefficiency. However, this is only of
limited help because the storage client lacks information about the
behaviour of the chunk cache, and the storage server lacks information
about the access patterns of the uploader or downloader.
A possible improvement that I'm quite enthusiastic about for an HTTP-based
protocol is to make blocks the same thing as chunks. That is, the segment
size would be k times the chunk size, and the uploader and downloader
would directly store or request chunks, rather than blocks, from the
backend storage.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/510#comment:22>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list