[Tahoe-dev] marginal improvements of different kinds of integrity check

Wed May 30 17:45:35 PDT 2007

Rob and Brian and I (three hackers employed by Allmydata, Inc.) have been
having a sometimes-heated conversation about different kinds of cryptographic
integrity check that one can perform on a stored file.  Here are some notes on
the "marginal improvement" justifications for different kinds.

1.  Merkle Tree (whether of ciphertext or of cleartext) marginal improvement on
plain hash of the whole thing:  the Merkle Tree allows incremental integrity
check -- before you've seen (or processed) the whole thing.  For plaintext,
this means you can consume or store the data in a streaming fashion during
download.  For ciphertext, this means that a checker/repairer who is not
authorized to read the plaintext can check the integrity of the ciphertext in a
streaming fashion, which means they can do it with only a small fixed amount of
storage.

2.a.  Integrity check on plaintext marginal improvement on integrity check on
ciphertext:  it is conceivable to have the right hash-or-merkle-root-of-
ciphertext, guaranteeing the correct ciphertext, but the wrong decryption key.  

2.b.  There's another ostensible marginal improvement of integrity check on
plaintext vs. integrity check on ciphertext, which is that our implementation
of encryption may have bugs that are caught by integrity check on the
plaintext.  I find this to be a very weak argument -- if our quality control
and testing are so poor that there are bugs in our encryption code, then there
might be bugs in our integrity-check-on-plaintext code too.  Indeed, it could
be that our encryption code was correct, but that bugs in our added integrity
check cause your data to be lost.  So whether adding an integrity-check-on-
plaintext makes your data more or less safe is not clear from that argument.

Indeed, my experience with a related issue at Mojo Nation was that we went
ahead and added some redundant never-going-to-need-it check and then it turned
out that the added code *did* have bugs and we thought the better of it and
removed that code altogether.

Anyway, whether 2.b. is a good reason or not is immaterial since 2.a. is a good
reason.  However, the same argument will come up again, below...

3.  Integrity check on ciphertext marginal improvement on integrity check on
plaintext:  so people not authorized to read the plaintext can check the
integrity of the ciphertext.

4.  Plain hash of the whole thing marginal improvement on Merkle Tree: this is
the point where argument 2.b. comes back.  The only possible case where a
Merkle Tree can fail you and a plain hash of the same data save you is the case
that you have bugs in your Merkle Tree implementation.  Ours [1] is currently
196 lines of code, excluding comments, and it is covered by the unit tests [2].
An implementation of "also do a plain hash of the whole file" would probably be
about the same order of magnitude of lines of code.  Even if the extra check
were only 1/2 as many lines of code, it is still probably a loss as the added
complexity (conceptual complexity as well as code complexity) would exceed the
negligible value of the check.

5.  Integrity check on shares marginal improvement on integrity check on
ciphertext: this allows us to identify *which* shares were bad.  This is
very valuable.

This note doesn't talk about the costs of the checks (other than the cost of
conceptual complexity).  My current belief is that the computation, storage,
and transport costs of all of these forms of integrity will be sufficiently
small.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/browser/src/allmydata/hashtree.py
[2] http://allmydata.org/tahoe-figleaf/figleaf-edgy-204/allmydata.hashtree.html