[tahoe-lafs-trac-stream] [tahoe-lafs] #1543: rearrange share format to make downloads faster

Mon Sep 26 01:34:06 PDT 2011

#1543: rearrange share format to make downloads faster
-------------------------------+-------------------------
     Reporter:  warner         |      Owner:
         Type:  enhancement    |     Status:  new
     Priority:  major          |  Milestone:  undecided
    Component:  code-encoding  |    Version:  1.9.0a1
   Resolution:                 |   Keywords:  performance
Launchpad Bug:                 |
-------------------------------+-------------------------

Comment (by zooko):

 >  * Zooko's idea was to store extra hashnodes. We could put the full
 merkle-chain after each data block. My thought was to further append them
 with the bottom-most nodes first. This would increase storage
 requirements, because we'd be storing multiple redundant copies of the
 hash tree nodes. But reads would be much faster: each segment would
 require a block fetch of the data block plus 0 or more trailing hash nodes
 (depending upon which nodes were already known). That means one contiguous
 IO operation per block, which will be the fastest/most-efficient for the
 storage servers
 >   * integrity storage overhead with this approach becomes O(nlogn)
 instead of O(n). Rough estimate: for a 10MB file, segsize=128KiB, I think
 that needs 18kB overhead instead of 5kB (per share), about N*0.1% total,
 maybe double that because there are two file-sized hash trees
 (crypttext_hash_tree, block_hash_tree).

 Another alternative would be to start using multiple variable-length
 records (i.e. files) on the backend. In all three of our current backend
 designs, we chose to use the minimal number of variable length records
 (i.e. files) per share. That minimal number is 1 -- each share is stored
 in one file, including the share data, the hash tree nodes, and the
 leases. This choice minimizes the number of disk seeks necessary to find
 the share data and the hash tree nodes, but if the share data is large and
 the reader isn't going to be reading all of it in order then we may end up
 needing to jump around within it (or read over data that we don't need),
 which might end up costing us an extra seek or a few anyway.

 Another problem with storing both share data and hash tree nodes together
 in a single file is what to do if the share data shrinks or grows, which
 can happen with SDMF and MDMF but not with immutables, or if the uploader
 doesn't know the size of the share data when they begin the upload, which
 currently can't happen but I hope will happen in the future (#1288).

 Designing a layout and an algorithm to manage that which optimizes
 performance can get complicated, and the recent security issue #1528 has
 really driven home to me how even moderate complexity hides bugs. (Mea
 culpa: it hid the bug from me, not from Brian -- Brian implemented this
 moderate complexity correctly, originally.) Even if we get it right, it
 takes engineering time to do so.

 So an alternative which makes it simpler for us is to use more variable-
 length records (files) per share, i.e. the storage server creates one file
 to hold the share data and another to hold the hash tree nodes. This
 costs, I think, one extra disk seek to find the second file (given that
 you've already paid the several disk seeks to find the first file), but it
 makes things easier and might actually win back some of the performance
 lost to the extra seek:

 * for the writer it makes things simpler because you can append more and
 more share data without pre-allocating the space, which allows you to
 start appending share data without even knowing how much you'll ultimately
 append (re: #1288) and which means you can shrink or extend an existing
 mutable file without needing to move the existing hash tree. Whenever the
 writer appends new share data it, it also appends the corresponding hash
 tree nodes in the file that stores the hash tree nodes. Likewise whenever
 an MDMF writer truncates share data, it truncates the corresponding hash
 tree nodes from the file that holds the hash tree. (Overwriting existing
 data requires changes that touch more parts of the hash tree so the effect
 on the hash tree isn't as localized as appending or truncating.)
 * for the reader it makes things simple and efficient because:
  * It knows where all the share data and all the hash tree data live
 before it receives the results of any reads back and even (for mutables)
 before it knows the current size of the file! This minimizes number of
 required round-trips.
  * The share data is contiguous -- uninterrupted by hash node data (as it
 would be in my proposal to intersperse the two, mentioned in the initial
 description of this ticket), so bulk reads of share data are get optimal
 throughput.
  * The hash tree data ''can'' still be be laid out in interesting ways to
 facilitate fast reads. There is never bulk data interspersed with the hash
 tree nodes that you are trying to read.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1543#comment:2>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage