[tahoe-lafs-trac-stream] [tahoe-lafs] #1543: rearrange share format to make downloads faster
tahoe-lafs
trac at tahoe-lafs.org
Mon Sep 26 01:34:06 PDT 2011
#1543: rearrange share format to make downloads faster
-------------------------------+-------------------------
Reporter: warner | Owner:
Type: enhancement | Status: new
Priority: major | Milestone: undecided
Component: code-encoding | Version: 1.9.0a1
Resolution: | Keywords: performance
Launchpad Bug: |
-------------------------------+-------------------------
Comment (by zooko):
> * Zooko's idea was to store extra hashnodes. We could put the full
merkle-chain after each data block. My thought was to further append them
with the bottom-most nodes first. This would increase storage
requirements, because we'd be storing multiple redundant copies of the
hash tree nodes. But reads would be much faster: each segment would
require a block fetch of the data block plus 0 or more trailing hash nodes
(depending upon which nodes were already known). That means one contiguous
IO operation per block, which will be the fastest/most-efficient for the
storage servers
> * integrity storage overhead with this approach becomes O(nlogn)
instead of O(n). Rough estimate: for a 10MB file, segsize=128KiB, I think
that needs 18kB overhead instead of 5kB (per share), about N*0.1% total,
maybe double that because there are two file-sized hash trees
(crypttext_hash_tree, block_hash_tree).
Another alternative would be to start using multiple variable-length
records (i.e. files) on the backend. In all three of our current backend
designs, we chose to use the minimal number of variable length records
(i.e. files) per share. That minimal number is 1 -- each share is stored
in one file, including the share data, the hash tree nodes, and the
leases. This choice minimizes the number of disk seeks necessary to find
the share data and the hash tree nodes, but if the share data is large and
the reader isn't going to be reading all of it in order then we may end up
needing to jump around within it (or read over data that we don't need),
which might end up costing us an extra seek or a few anyway.
Another problem with storing both share data and hash tree nodes together
in a single file is what to do if the share data shrinks or grows, which
can happen with SDMF and MDMF but not with immutables, or if the uploader
doesn't know the size of the share data when they begin the upload, which
currently can't happen but I hope will happen in the future (#1288).
Designing a layout and an algorithm to manage that which optimizes
performance can get complicated, and the recent security issue #1528 has
really driven home to me how even moderate complexity hides bugs. (Mea
culpa: it hid the bug from me, not from Brian -- Brian implemented this
moderate complexity correctly, originally.) Even if we get it right, it
takes engineering time to do so.
So an alternative which makes it simpler for us is to use more variable-
length records (files) per share, i.e. the storage server creates one file
to hold the share data and another to hold the hash tree nodes. This
costs, I think, one extra disk seek to find the second file (given that
you've already paid the several disk seeks to find the first file), but it
makes things easier and might actually win back some of the performance
lost to the extra seek:
* for the writer it makes things simpler because you can append more and
more share data without pre-allocating the space, which allows you to
start appending share data without even knowing how much you'll ultimately
append (re: #1288) and which means you can shrink or extend an existing
mutable file without needing to move the existing hash tree. Whenever the
writer appends new share data it, it also appends the corresponding hash
tree nodes in the file that stores the hash tree nodes. Likewise whenever
an MDMF writer truncates share data, it truncates the corresponding hash
tree nodes from the file that holds the hash tree. (Overwriting existing
data requires changes that touch more parts of the hash tree so the effect
on the hash tree isn't as localized as appending or truncating.)
* for the reader it makes things simple and efficient because:
* It knows where all the share data and all the hash tree data live
before it receives the results of any reads back and even (for mutables)
before it knows the current size of the file! This minimizes number of
required round-trips.
* The share data is contiguous -- uninterrupted by hash node data (as it
would be in my proposal to intersperse the two, mentioned in the initial
description of this ticket), so bulk reads of share data are get optimal
throughput.
* The hash tree data ''can'' still be be laid out in interesting ways to
facilitate fast reads. There is never bulk data interspersed with the hash
tree nodes that you are trying to read.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1543#comment:2>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list