[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2387: transverse block-hash-trees, plus segment-hash-tree

Tue Feb 17 20:53:49 UTC 2015

#2387: transverse block-hash-trees, plus segment-hash-tree
---------------------------+---------------------------
 Reporter:  warner         |          Owner:
     Type:  enhancement    |         Status:  new
 Priority:  normal         |      Milestone:  undecided
Component:  code-encoding  |        Version:  1.10.0
 Keywords:                 |  Launchpad Bug:
---------------------------+---------------------------
 While discussing #2386, Daira identified an alternate encoding scheme
 which might resolve the efficient-mutable-update problem.

 The current encoding scheme starts with block-hash-trees (one per share,
 with one leaf per block, each block being an erasure-coded fragment of a
 different ciphertext segment) that culminate in the block-hash-tree-root
 (one per share). On top of that, a "transverse" share-hash-tree is built,
 which contains one leaf per share (leaf==block-hash-tree-root),
 culminating in the share-hash-tree-root. The share-hash-tree-root goes
 into the UEB (immutable) and the signed RSA block (mutable).

 The drawback to this scheme is that modifying a single block causes new
 block-hash-tree-roots to be created for all shares, so we need block-hash-
 tree data from all shares to modify any of them. As #2386 describes, this
 means we can't efficiently modify unhealthy files.

 An alternate encoding would rotate the trees. In this approach, the lower
 hash tree the transverse one. It is still a block-hash-tree, but instead
 of one leaf per segment (and one block-hash-tree per share), it would
 contain one leaf per share (and we'd have one block-hash-tree per
 segment). The block-hash-tree for seg0 uses the first block of each share
 as its N leaves, and contains enough information to prove that any one
 block (for segment X and share Y, X in range(numsegs), Y in
 range(num_servers==N)) belongs to a given segment (`block-hash-tree-
 root[X]`).

 On top of that, we'd have a non-transverse "segment-hash-tree", with one
 leaf per ''segment''. The segment-hash-tree-root would be included in the
 UEB as before. The entire segment-hash-tree would be replicated into each
 share (this is 64 bytes per segment).

 Each block would be paired with the block-hash-tree Merkle chain needed to
 verify the block's membership in the block-hash-tree up to the block-hash-
 tree root (32 bytes * log(N)). Every time the client fetches a block, they
 also fetch this Merkle chain. Fetching this chain is so common that it
 might be best to interleave the hashes with the data blocks (to reduce
 seeks), rather than putting all the block-hash-tree data in a separate
 portion of the share (like #1687, but without the random-access
 requirement during upload).

 If it weren't for this last interleaving trick, it might be possible to
 add this encoding to the current (immutable) scheme, which is somewhat
 open-ended. We can add new fields to the UEB (which is simple JSON-
 encoded), and the offsets table doesn't need to cover the entire file, so
 the UEB could also contain the offset of the new hash tree data. There's
 not much point to doing this, however, since fixing #2386 requires
 ''removing'' the non-transverse block-hash-trees.

 Our encoding scheme is over eight years old, in fact the very first commit
 [4968ca9] was to calculate the storage/bandwidth/alacrity overhead of this
 scheme. It's hard to remember, but I don't think we considered rotating
 the trees back then. If we did, we might have dismissed for the following
 reasons:

 * It makes the shares less independent. There was a brief period when we
 wanted to make it possible to re-encode a file (with higher N) and reuse
 (most of) the original shares. Transverse block-hash-trees would interfere
 with that: increasing N would require changing more data. (our thinking
 here was also influenced by Tahoe's predecessor, which used Fountain Codes
 to produce arbitrary numbers of shares, so fixed-N was a new thing at the
 time). #678 and #711 are related.
 * In particular, we've considered making N fixed and large, like 256, and
 only producing a subset of the shares. If this only costs encoding
 '''time''', it's a fairly cheap way to enable after-the-fact "reencoding".
 If it costs space in a Merkle chain, then it's not quite as cheap. Fixed-N
 would also enable an easier-to-understand share-at-a-time uploader
 (#1340).
 * it requires downloading extra data all the time, like
 `O(filesize*k*ln(N))`. For each segment, you need k blocks, and for each
 block, you need an uncle chain of length `ln(N)`. The original encoding
 scheme lets you decide between alacrity and bandwidth overhead: if you
 know you're downloading the entire file and don't need to deliver the
 first bytes early, you can skip fetching the block-hash-tree and just
 recompute it entirely. The bandwidth overhead then reduces to the
 transverse share-hash-tree, which is `O(k*ln(N))`. Since N is generally
 small, and filesizes can be large, this is a win. Of course if you want
 alacrity, you need the block-hash-tree Merkle chain for each segment, so
 you must download `O(k*ln(filesize))` bytes for the first segment, and
 somewhat less for subsequent ones, requiring something like
 `O(filesize*k*ln(N))/2` for a complete linear read.

 It might be fair to say we've optimized for single-segment reads, at the
 expense of mutable updates.

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2387>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage