[tahoe-dev] Choice of tree-hash

Wed Sep 19 18:23:55 UTC 2012

On 9/19/12 9:59 AM, CodesInChaos wrote:

> One topic I'm particularly interesting in is designing and specifying
> a good tree-hash, especially the tree-hash of the plaintext.

For reference, here's what Tahoe currently does.

(Note that we currently only apply the tree-hash to ciphertext. We used
to build a plaintext hash tree too, but removed it because it enabled
partial-information-guessing attacks, just like revealing a flat hash of
the plaintext, but worse. We hope to reintroduce it safely, by uploading
encrypted(plaintext-hash-tree) in addition to
not-encrypted(ciphertext-hash-tree). This would let us detect failures
in the last decryption step.)

(We also use a hash-tree over the erasure-coded segments, the
"block-hash-tree". In this case, the leaves are individual segments, not
portions of a "file" that you'd ever work with in its entirety.
Similarly, the share-hash-tree's leaves are the roots of the
block-hash-tree. So it's convenient for us to define a tree hash in
which the leaf hashes are provided explicitly, rather than always being
some hash of file[leafnum*leafsize:(leafnum+1)*leafsize]. )

> = 1. Hash size =
> = 2. Hash primitive =

SHA256(SHA256(leaf)), aka "SHA256d". This is easy to specify and
prevents length-extension attacks.

> = 3. Leaf size =

Configurable by the uploader (although not necessarily exposed in an
API) as "min_segsize". Defaults to 128KiB, expected to be a power of
two, expected to be larger for large files.

The storage overhead is always linear with the number of leaves, so
small leaves result in a higher overhead. The very first code committed
to Tahoe 6 years ago was a tool to compute this overhead. With 256-bit
hashes (32 bytes), and since the tree has as many internal nodes as
leaves (so you must hold 2*numleaves hashes), each leaf incurs 64 bytes
of overhead. 1KiB leaves need 6.25% extra storage. We picked 128KiB to
reduce that percentage to 0.05%, and estimated that downloading clients
could retrieve 128KiB in at few seconds at most (minimum alacrity).

> = 4. Which tree construction? =

Tahoe uses tagged hashes exclusively: every use of a hash function gets
a distinct tag:

 SHA256(SHA256(netstring(tag)+value))

and we also define a "tagged pair hash":

 SHA256(SHA256(netstring(tag)+netstring(value1)+netstring(value2)))

For Merkle trees, internal nodes use a tagged_pair_hash with a tag of
"Merkle tree internal node". Empty leaves get a tagged_hash with tag
"Merkle tree empty leaf" and value of the leaf number (as a decimal
string). Callers are responsible for providing non-empty leaf hashes
themselves; for the ciphertext hash tree, we use a tagged_hash with tag
of "allmydata_crypttext_segment_v1". By providing a leaf hash, issues
like "how to treat non-power-of-two sizes" (i.e. the "tail") are
bypassed: the caller just hashes each segment without padding.

Note that this does not prevent length-extension attacks against the
overall tree: if you have a two-deep tree (but don't know the leaves),
you can construct a three-deep tree with additional leaves (under your
control) that includes the unknown leaves and compute a valid hash. So
don't use Tahoe's scheme as an tree-level HMAC. But it is safe against
confusing internal nodes with leaves (so avoids the two-faced-block
attack that Bitcoin recently had to work around).

> = 5. Extra info in the root? =

Tahoe doesn't prepend anything to the file before hashing.. I'd be
worried about people accidentally corrupting files by not adding or
removing the right data at the right time. Similarly, plaintext length
is transmitted elsewhere (in an authenticated location: either included
in or covered by the filecap).

hope that's useful,
 -Brian