[tahoe-dev] Choice of tree-hash

Wed Sep 19 16:59:35 UTC 2012

I've also opened a trac ticked for this:
https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1813

One topic I'm particularly interesting in is designing and specifying
a good tree-hash, especially the tree-hash of the plaintext.
Standardizing on such a hash would create a protocol transcending file
id. Reusing the same hash-functions across projects such as
tahoe-lafs, cryptosphere or my own project could lead to nice synergy
effects.  In many such systems, the plaintext hash serves as
identifier for the file, similar to what git does with SHA-1. One
could use this to merge downloads from different sources, or as a key
to look up a file in an online database.

Obviously a binary tree of unlimited depth is the best choice here.
The hashfunction must be collision and second pre-image resistant.
Still there are a number of interesting detail choices:

= 1. Hash size =

What output size should the hashfunction have? To get collision
resistance, 256 bit seems to be the minimum acceptable value. Values
between 256 and 512 bit seem reasonable.

My suggestion: 256 bits

= 2. Hash primitive =

Of the current generation hash-functions, SHA-2 seems like the best
choice. It's security margin seems to be a bit on the low side, so I'd
prefer a SHA-3 finalist, since those seem to have a larger margin. The
selection of SHA-3 is around the corner, so waiting for that seems
reasonable. On the other hand, SHA-3 candidates have probably seen
less cryptoanalysis than SHA-2.

My suggestion: SHA-3-256, whatever primitive might be chosen.

= 3. Leaf size =

Should be a power of two. The leaf size should be independent from any
higher level parameters, such as segment size. Larger leaves reduce
the CPU cost somewhat, but have no space benefit. Smaller leaves
increase the CPU cost a bit, but allow verification of smaller pieces.
Values around 1KiB seem like a reasonable trade-off, but perhaps we
should benchmark this.

My suggestion: 1 KiB

= 4. Which tree construction? =

The simplest merkle tree adds a prefix of 0 for leaves and 1 for inner
nodes. Alternatively one could include the depth as a 1 byte integer.

Should we add personalization/purpose strings at the leaf level?

How to treat non power-of-two sizes? Pull up the lower layer hashes?
Or pad the lowest layer with all zero hashes?

Should leaves include an offset from the beginning of the file, like skein does?

Should the last leaf be tagged with a `Final` bit? Else the tree
suffers from length-extensions. But that shouldn't be a practical
problem.

In my current draft I use a 7 byte `purpose` string and a one byte
depth integer as the prefix to all hashes. I pad the leaf hashes with
zeros to the next power of two. But there are probably better choices.

= 5. Extra info in the root? =

One interesting idea I had, was to hash some additional information
together with the root of the tree, and use that as the identifying
hash. My idea for the extra information were the first 24 bytes of the
file, and the file size as an 8 byte integer.

This has some nice properties:

1. The size of the complete hash tree, truncated at an arbitrary layer
is a power of two.
2. If you only know the identifying hash, you can ask somebody else
about the first bytes and file size, and they can't lie. The first few
bytes tell you the file type for most files(magic bytes), and the
known file size allows you to preallocate the necessary space, and
prevents bombing attacks.

= 6. Draft from my project =

For completeness I reproduce the hashing part from my current draft
specification:

-------------------------------------------------------------------------------

== Basic ideas ==

I want a FileID that's independent from the encrypted transport
representation. This allows merging downloads from different transport
representations. This is important if one wants to change the sharing
protocol in the future. For example by adding splitting and error
correction like in Tahoe-LAFS, or using block wise transports like in
Freenet.

I decided to use a 256 bit hash, which offers 256 bit resistance
against second pre-images, and 128 bit resistance against collisions.
This seems like an appropriate level, since second pre-images would
totally break the system, and collisions are pretty annoying but not
fatal.

When creating my treehash system, I noticed that the total size tree
is a power of two, apart from a single unused hash-sized part. I
decided to fill those with the size of the file and the first 24 bytes
of the file. This allows quick sanity checking of the most important
properties of a file without downloading the file itself: its size and
the file type.

== Hash Tree ==

########### ToDo: Is 1024 bytes the optimal leaf size?

########### If SHA3 gets selected before this project gets popular,
I'll probably replace all uses of SHA-256 with SHA3, keeping the hash
size at 256 bits.

########### Skein-256-256 in tree mode would be an alternative. It
also has a built-in personalization feature. Unfortunately
Skein-512-256 doesn't support 256 bit intermediate hashes :(

This defines the tree-hash. It's a binary merkle-tree based on
SHA-256. It takes the `LeafSize` and a 7 byte ASCII string as purpose
parameter.

Define the function `Hash(Depth, Message)` as `SHA-256(Purpose ||
Depth || Message)` where `||` denotes concatenation.

To calculate the tree-hash of a file:

1. Divide the file into `LeafSize` byte blocks. The last block may be shorter.

2. For each of those blocks calculate `Hash(0, LeafBlock)`. If the
number of hashes is not a power-of-two, pad the list of hashes(not the
file) to the next power-of-two with `00` bytes.

3. For each pair of hashes from the previous step, calculate
`Hash(Depth, HashPair)` where `HashPair` is the concatenation of the
two child hashes. If the right child hashes is all `00` simply keep
the left hash. `Depth` is `1` for the first level above the leaves,
and gets incremented with each level.

4. Repeat step 3 until there is only one hash left.

5. Concat the following to form the 64 byte `RootBlock`:
  * The size of the file as 8 byte big-endian integer
  * The first 24 bytes of the file (padded with `00` if the file is
shorter than that)
  * The root of the original tree (result of step 4)

6. Calculate `Hash(255, RootBlock)`.

== TreeBlock format: ==

Concat:

* The size of the file as 8 byte big-endian integer
* The first 24 bytes of the file (padded with `00` if the file is
shorter than that)
* The individual hashes in the tree, layer wise, starting with the root.

Note: This implies that the `TreeBlock` starts with the `RootBlock`.

This tree may be truncated at the end of any layer. Typically at a
fixed size, called `PieceSize` (e.g. 2^15 bytes = 32 KiB)

It must at least contain the first 64 bytes (the root block). For
files smaller than the `PieceSize`, it will always consist of 64
bytes.
When doing this, the size of the tree in bytes will always be a power-of-two.

== FileID ==

Set `Purpose` to `"FileID\0"` and `LeafSize` to 1024. The hash of the
root block (result of step 6) is the FileID.

-------------------------------------------------------------------------------