[tahoe-dev] Interesting hashing result

Mon Feb 16 00:51:35 PST 2009

On Sunday 15 February 2009 09:04:33 am zooko wrote:
> Thanks for the information.  Could you give me a couple more details
> -- what sizes of files are in your test set, how long does it take
> with cold cache, and how long with warm cache?

I tested on various sizes of files, but of course you can only see the 
difference on large ones.  I was using VMware virtual disk files, around 2 
GiB in size.  With cold cache, there's no significant time difference, since 
the operation is I/O bound.

> Also, what is this signature that you are generating?  Is it
> something that is generated and used just by your backup tool?

The rsync signature.

The way rsync works for remote synchronization is it generates a signature, 
which is a sort of a rolling checksum, on one side and then sends that 
signature to the other side.  The other side uses it to determine which parts 
of its local copy of the file are different from the sender's version.  Then 
it requests updates of those portions and that's how the file is brought in 
sync.

The rolling checksum is actually implemented as a pair of checksums, one weak 
(a CRC, I believe) and one strong (MD4 -- which isn't, of course, all that 
strong any more).

And, yes, the signature is obviously something my backup tool needs.

> Tahoe uses a strong hash function to generate the immutable file caps>
> from the ciphertext so that we get this property:
>
>     The "At Most One File Per Immutable File Cap" Property:
>
>     There can exist (in this universe, in the forseeable future) at
> most one file matching any given immutable file capability.
>
> In cryptographic terms, this is called "collision-resistance".

Yes, and I need the same property -- and so does rsync.  If two different 
files were to generate the same signature, then rsync would believe that 
those two file were the same.  In addition, a remote file being updated would 
request the same set of patches to bring it into synchronization with either 
of the two putative same-signature files -- and those patches would be wrong 
in one of the cases.

Hmmm.

I've been basing my lack of concern about the MD4 weaknesses on the (assumed) 
lack of concern about rsync.  Your questions prompted me to look into that 
assumption, however, and I see that there IS concern about MD4 in the rsync 
community.  Their (as yet unimplemented) solution seems to be to provide a 
random seed to MD4, which is probably the output of a strong hash of the file 
content, and also to send a whole-file hash for final integrity checking.

MD4 is perfectly functional in a non-malicious environtment, of course.  The 
probability of two "real" files colliding is negligible.

But what about in the presence of a malicious attacker who wants to corrupt a 
file?

Suppose the attacker knows a file on your system and knows your convergence 
secret.  Thus, the attacker knows the CHK of that file and therefore the 
storage ID.  Suppose also that the hash algorithm is weak and that the 
attacker can generate a different file with the same hash, key and SID.

The attacker can then replace your file in the grid with his.  This will 
effectively destroy your file.  You won't mistakenly download his file 
thinking it's yours, though, because the UEB hash from your read cap will not 
match his.  Someone who has the encryption key but not the read cap could 
download the file and believe it to be the original.

I don't think this is a problem.  Or at least, it's not a problem that doesn't 
exist even without the weak hash.  If the attacker knows the storage ID of 
your file, he can replace it in the grid -- he doesn't need to be able to 
generate another file that hashes to the same value.

> This property might not be needed for some applications, but it is
> very tricky for application writers to know when they can safely rely
> on a weaker property, such as the property that cryptographers call
> "second-pre-image-resistance".

Actually, MD4 is vulnerable to second pre-image collision attacks (with 
probability 2^-56, or 2^-27 if the attacker can modify parts of the original 
file), so it doesn't even have that.  As I explained above, I don't think 
this is an issue, given that the file integrity is not dependent upon the 
encryption key derivation.  That is assured by the other hashes, which are 
all SHAD-256

> Unfortunately the SHA-256d secure hash function imposes a significant
> CPU overhead.  Currently I think that all actual uses of Tahoe are I/
> O-bound anyway, and have ridiculously overpowered CPUs anyway, so I
> don't think this CPU-overhead is causing an actual performance
> problem for anyone in practice, but hopefully Tahoe will move into
> more and more use cases, and as it does this might become a problem.

The process is I/O bound on most machines.  There are exceptions, however.  
One such is my current home file server, which is only a 1 GHz Athlon, but 
has 6 SATA drives in it in a software RAID-6 configuration.  That machine can 
sustain read rates of nearly 100 MBps, BUT uses basically 100% of the CPU to 
do it.  Every bit of additional computation overhead decreases the 
throughput -- and I doubt the machine could do SHAD-256 calculations at 100 
MBps even if it weren't doing anything else (I should test that).

Another use case that I plan to try in the near future is to attach a big USB 
drive to a Linksys router running custom firmware, and use that as a Tahoe 
node.  That's one easy way to work around the NAT problem, and it also 
provides an easy way to add a "backup server" to a home network.  Put a CIFS 
share on the router, let users drop any files they want backed up into that 
share, and then let the Tahoe node running on the router handle backing the 
data up to the grid.

There are all sorts of ways in which that latter idea may be impractical -- 
maybe the router doesn't have enough RAM to run a node, for example.  But 
it's definitely one I want to explore, and it's one in which the CPU is 
pretty anemic.

However, even if that use case were really crucial, I don't think you want to 
look into swapping out SHAD-256 for the file integrity computations.  A 
weaker hash doesn't seem to hurt key derivation, but file integrity is a 
different kettle of fish.

> In the year 2012 (hey, we're living in the future!), the new SHA-3
> hash function will be chosen.  That function will also, I hope,
> require about 1/3 as many CPU cycles as SHA-256 does while being a
> safer long-term bet.

If the result parallels the success of the AES selection process, it may be 
even faster than that.

	Shawn.