[tahoe-lafs-trac-stream] [tahoe-lafs] #1687: store copy of block-hash-chain with each block

Wed Mar 14 20:57:52 UTC 2012

#1687: store copy of block-hash-chain with each block
---------------------------+---------------------------
 Reporter:  warner         |          Owner:
     Type:  enhancement    |         Status:  new
 Priority:  major          |      Milestone:  undecided
Component:  code-encoding  |        Version:  1.9.1
 Keywords:                 |  Launchpad Bug:
---------------------------+---------------------------
 One idea that Zooko and I came up with at pycon last week was to make
 downloads more efficient by rearranging the shares to append a copy
 of the necessary block-hash-tree nodes (its "uncle chain") to each
 block.

 The current share layout puts the whole block-hash-tree at the end of
 the share. To read block 0, the client must fetch the block data,
 plus log(N) hashes from the BHT. If it then reads block 1, it doesn't
 need to get any new hashes (it already has the necessary internal
 nodes). Reading block 2 requires it to get one new hash. The number
 of hash nodes needed for each block of a sequential read follows a
 funny sort of inverse-binary pattern: I think it equals the number of
 bits that change between a binary representation of the current
 segnum and the previous one, so it's log(N)-ish large for large
 powers of two, and small for every-other segment. This is pretty
 visible in the download-timeline visualizer layout. This layout
 minimizes the number of bytes needed to download any given segment
 (download alacrity), and minimizes the storage overhead spent on
 integrity data : 2*N hashes (where N is the number of segments:
 ceil(filesize/128KiB)).

 The overhead of fetching those extra nodes has a noticeable effect on
 the speed of the new immutable downloader. It's not the number of
 bytes being fetched, but rather some per-request overhead: either
 Foolscap serialization time, the method-invocation overhead, or
 perhaps the extra disk IO (although I expect the filesystem to cache
 that part pretty effectively).

 We could instead put an extra copy of each chain on each block. So
 block 0 would immediately be followed by the log(N)-long uncle-chain
 necessary to do an independent "cold" verification of block 0, block
 1 would be followed by the chain for block 1, etc. The hashes would
 be sorted to put the higher-level nodes at the end, so that you'd
 only need to read to the end of the chain on the first block.

 Then the downloader could make exactly one contiguous read for block
 0, a shorter single read for block 1, etc. The storage overhead would
 grow: I think it would need N*ln2(N) hashes, maybe N*(1+ln2(N)), not
 quite sure. But I think this would be pretty reasonable.. I need to
 build a simulator and measure it.

 The biggest problem is that this requires random-access during the
 write process: the hash nodes aren't available until after the whole
 file has been processed. So we'd need to enhance the storage-server
 protocol to let you seek backwards to an earlier position in the file
 and overwrite some placeholder bytes that you put there earlier. Or
 define shares to be made up of multiple chunks and allow uploaders to
 write those chunks out-of-order. Either way complicates the
 share-upload process, and probably makes certain backends (i.e. S3)
 trickier.

 We could also have the client encode the entire file locally, fill in
 the hash nodes, then upload the resulting shares. But our experience
 with Tahoe's predecessor (the Allmydata derivative of the
 !Mnet/Hivecache that we retroactively named "Mountain View") was that
 this gives horrible disk access patterns: it was like filling in a
 grid along one axis, then reading it back out along the other, and
 the disk went into a massive seek frenzy. To fix this in Tahoe, we
 moved the matrix-transposition burden to the network, and stream each
 segment as we finish encoding it. So I'd definitely go for an updated
 write-seek-write-close storage protocol before trying to encode the
 whole file on the client side.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1687>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage