#1687 new enhancement

store copy of block-hash-chain with each block

Reported by: warner Owned by:
Priority: normal Milestone: undecided
Component: code-encoding Version: 1.9.1
Keywords: performance bandwidth Cc:
Launchpad Bug:


One idea that Zooko and I came up with at pycon last week was to make downloads more efficient by rearranging the shares to append a copy of the necessary block-hash-tree nodes (its "uncle chain") to each block.

The current share layout puts the whole block-hash-tree at the end of the share. To read block 0, the client must fetch the block data, plus log(N) hashes from the BHT. If it then reads block 1, it doesn't need to get any new hashes (it already has the necessary internal nodes). Reading block 2 requires it to get one new hash. The number of hash nodes needed for each block of a sequential read follows a funny sort of inverse-binary pattern: I think it equals the number of bits that change between a binary representation of the current segnum and the previous one, so it's log(N)-ish large for large powers of two, and small for every-other segment. This is pretty visible in the download-timeline visualizer layout. This layout minimizes the number of bytes needed to download any given segment (download alacrity), and minimizes the storage overhead spent on integrity data : 2*N hashes (where N is the number of segments: ceil(filesize/128KiB)).

The overhead of fetching those extra nodes has a noticeable effect on the speed of the new immutable downloader. It's not the number of bytes being fetched, but rather some per-request overhead: either Foolscap serialization time, the method-invocation overhead, or perhaps the extra disk IO (although I expect the filesystem to cache that part pretty effectively).

We could instead put an extra copy of each chain on each block. So block 0 would immediately be followed by the log(N)-long uncle-chain necessary to do an independent "cold" verification of block 0, block 1 would be followed by the chain for block 1, etc. The hashes would be sorted to put the higher-level nodes at the end, so that you'd only need to read to the end of the chain on the first block.

Then the downloader could make exactly one contiguous read for block 0, a shorter single read for block 1, etc. The storage overhead would grow: I think it would need N*ln2(N) hashes, maybe N*(1+ln2(N)), not quite sure. But I think this would be pretty reasonable.. I need to build a simulator and measure it.

The biggest problem is that this requires random-access during the write process: the hash nodes aren't available until after the whole file has been processed. So we'd need to enhance the storage-server protocol to let you seek backwards to an earlier position in the file and overwrite some placeholder bytes that you put there earlier. Or define shares to be made up of multiple chunks and allow uploaders to write those chunks out-of-order. Either way complicates the share-upload process, and probably makes certain backends (i.e. S3) trickier.

We could also have the client encode the entire file locally, fill in the hash nodes, then upload the resulting shares. But our experience with Tahoe's predecessor (the Allmydata derivative of the Mnet/Hivecache that we retroactively named "Mountain View") was that this gives horrible disk access patterns: it was like filling in a grid along one axis, then reading it back out along the other, and the disk went into a massive seek frenzy. To fix this in Tahoe, we moved the matrix-transposition burden to the network, and stream each segment as we finish encoding it. So I'd definitely go for an updated write-seek-write-close storage protocol before trying to encode the whole file on the client side.

Change History (2)

comment:1 Changed at 2012-03-29T19:10:35Z by davidsarah

  • Keywords performance bandwidth added
  • Priority changed from major to normal

comment:2 Changed at 2012-03-29T19:22:12Z by davidsarah

Closely related to #1543; maybe we should merge these tickets.

Note: See TracTickets for help on using tickets.