[tahoe-lafs-trac-stream] [tahoe-lafs] #510: use plain HTTP for storage server protocol

Thu May 30 00:13:15 UTC 2013

#510: use plain HTTP for storage server protocol
------------------------------+---------------------------------
     Reporter:  warner        |      Owner:  taral
         Type:  enhancement   |     Status:  new
     Priority:  major         |  Milestone:  2.0.0
    Component:  code-storage  |    Version:  1.2.0
   Resolution:                |   Keywords:  standards gsoc http
Launchpad Bug:                |
------------------------------+---------------------------------

Old description:

> Zooko told me about an idea: use plain HTTP for the storage server
> protocol, instead of foolscap. Here are some thoughts:
>
>  * it could make Tahoe easier to standardize: the spec wouldn't have to
> include foolscap too
>  * the description of the share format (all the hashes/signatures/etc)
> becomes the most important thing: most other aspects of the system can be
> inferred from this format (with peer selection being a significant
> omission)
>  * download is easy, use GET and a URL of /shares/STORAGEINDEX/SHNUM,
> perhaps with an HTTP Content-Range header if you only want a portion of
> the share
>  * upload for immutable files is easy: PUT /shares/SI/SHNUM, which works
> only once
>  * upload for mutable files:
>    * implement DSA-based mutable files, in which the storage index is the
> hash of the public key (or maybe even equal to the public key)
>    * the storage server is obligated to validate every bit of the share
> against the roothash, validate the roothash signature against the pubkey,
> and validate the pubkey against the storage index
>    * the storage server will accept any share that validates up to the SI
> and has a seqnum higher than any existing share
>    * if there is no existing share, the server will accept any valid
> share
>    * when using Content-Range: (in some one-message equivalent of
> writev), the server validates the resulting share, which is some
> combination of the existing share and the deltas being written. (this is
> for MDMF where we're trying to modify just one segment, plus the modified
> hash chains, root hash, and signature)
>
> Switching to a validate-the-share scheme to control write access is good
> and bad:
>
>  * + repairers can create valid, readable, overwritable shares without
> access to the writecap.
>  * - storage servers must do a lot of hashing and public key computation
> on every upload
>  * - storage servers must know the format of the uploaded share, so
> clients cannot start using new formats without first upgrading all the
> storage servers
>
> The result would be a share-transfer protocol that would look exactly
> like HTTP, however it could not be safely implemented by a simple HTTP
> server because the PUT requests must be constrained by validating the
> share. (a simple HTTP server doesn't really implement PUT anyways). There
> is a benefit to using "plain HTTP", but some of the benefit is lost when
> in fact it is really HTTP being used as an RPC mechanism (think of the
> way S3 uses HTTP).
>
> It might be useful to have storage servers declare two separate
> interfaces: a plain HTTP interface for read, and a separate port or
> something for write. The read side could indeed be provided by a dumb
> HTTP server like apache; the write side would need something slightly
> more complicated. An apache module to provide the necessary share-write
> checking would be fairly straightforward, though.
>
> Hm, that makes me curious about the potential to write the entire Tahoe
> node as an apache module: it could convert requests for /ROOT/uri/FILECAP
> etc into share requests and FEC decoding...

New description:

 Zooko told me about an idea: use plain HTTP for the storage server
 protocol, instead of foolscap. Here are some thoughts:

  * it could make Tahoe easier to standardize: the spec wouldn't have to
 include foolscap too
  * the description of the share format (all the hashes/signatures/etc)
 becomes the most important thing: most other aspects of the system can be
 inferred from this format (with peer selection being a significant
 omission)
  * download is easy, use GET and a URL of /shares/STORAGEINDEX/SHNUM,
 perhaps with an HTTP Content-Range header if you only want a portion of
 the share
  * upload for immutable files is easy: PUT /shares/SI/SHNUM, which works
 only once
  * upload for mutable files:
    * implement DSA-based mutable files, in which the storage index is the
 hash of the public key (or maybe even equal to the public key)
    * the storage server is obligated to validate every bit of the share
 against the roothash, validate the roothash signature against the pubkey,
 and validate the pubkey against the storage index
    * the storage server will accept any share that validates up to the SI
 and has a seqnum higher than any existing share
    * if there is no existing share, the server will accept any valid share
    * when using Content-Range: (in some one-message equivalent of writev),
 the server validates the resulting share, which is some combination of the
 existing share and the deltas being written. (this is for MDMF where we're
 trying to modify just one segment, plus the modified hash chains, root
 hash, and signature)

 Switching to a validate-the-share scheme to control write access is good
 and bad:

  * + repairers can create valid, readable, overwritable shares without
 access to the writecap.
  * - storage servers must do a lot of hashing and public key computation
 on every upload
  * - storage servers must know the format of the uploaded share, so
 clients cannot start using new formats without first upgrading all the
 storage servers

 The result would be a share-transfer protocol that would look exactly like
 HTTP, however it could not be safely implemented by a simple HTTP server
 because the PUT requests must be constrained by validating the share. (a
 simple HTTP server doesn't really implement PUT anyways). There is a
 benefit to using "plain HTTP", but some of the benefit is lost when in
 fact it is really HTTP being used as an RPC mechanism (think of the way S3
 uses HTTP).

 It might be useful to have storage servers declare two separate
 interfaces: a plain HTTP interface for read, and a separate port or
 something for write. The read side could indeed be provided by a dumb HTTP
 server like apache; the write side would need something slightly more
 complicated. An apache module to provide the necessary share-write
 checking would be fairly straightforward, though.

 Hm, that makes me curious about the potential to write the entire Tahoe
 node as an apache module: it could convert requests for /ROOT/uri/FILECAP
 etc into share requests and FEC decoding...

--

Comment (by daira):

 The cloud backend, which uses HTTP or HTTPS to connect to the cloud
 storage service, provides some interesting data on how an HTTP-only
 storage protocol might perform. With request pipelining and connection
 pooling, it seems to do a pretty good job of maxing out the upstream
 bandwidth to the cloud on my home Internet connection, although it would
 be interesting to test it with a fatter pipe. (For downloads, performance
 appears to be limited by inefficiencies in the downloader rather than in
 the cloud backend.)

 Currently, the cloud backend splits shares into "chunks" to limit the
 amount of data that needs to be held in memory or in a store object (see
 [source:docs/specifications/backends/raic.rst]). This is somewhat
 redundant with segmentation: ciphertext "segments" are erasure-encoded
 into "blocks" (a segment is k = {{{shares.needed}}} times larger than a
 block), and stored in a share together with a header and metadata, which
 is then chunked. Blocks and chunks are not aligned (for two reasons: the
 share header, and the typical block size of 128 KiB / 3, which is not a
 factor of the 512 KiB default chunk size). So,
  * a sequential scan over blocks will reference the same chunk for several
 consecutive requests.
  * a single block may span chunks.

 The cloud backend uses [https://github.com/LeastAuthority/tahoe-
 lafs/blob/1c1aff2b58c121dbf74a781525d3f65060deb54d/src/allmydata/storage/backends/cloud/cloud_common.py#L575
 caching] to mitigate any resulting inefficiency. However, this is only of
 limited help because the storage client lacks information about the
 behaviour of the chunk cache, and the storage server lacks information
 about the access patterns of the uploader or downloader.

 A possible improvement that I'm quite enthusiastic about for an HTTP-based
 protocol is to make blocks the same thing as chunks. That is, the segment
 size would be k times the chunk size, and the uploader and downloader
 would directly store or request chunks, rather than blocks, from the
 backend storage.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/510#comment:22>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage