[tahoe-dev] Interesting hashing result
Shawn Willden
shawn-tahoe at willden.org
Mon Feb 16 00:51:35 PST 2009
On Sunday 15 February 2009 09:04:33 am zooko wrote:
> Thanks for the information. Could you give me a couple more details
> -- what sizes of files are in your test set, how long does it take
> with cold cache, and how long with warm cache?
I tested on various sizes of files, but of course you can only see the
difference on large ones. I was using VMware virtual disk files, around 2
GiB in size. With cold cache, there's no significant time difference, since
the operation is I/O bound.
> Also, what is this signature that you are generating? Is it
> something that is generated and used just by your backup tool?
The rsync signature.
The way rsync works for remote synchronization is it generates a signature,
which is a sort of a rolling checksum, on one side and then sends that
signature to the other side. The other side uses it to determine which parts
of its local copy of the file are different from the sender's version. Then
it requests updates of those portions and that's how the file is brought in
sync.
The rolling checksum is actually implemented as a pair of checksums, one weak
(a CRC, I believe) and one strong (MD4 -- which isn't, of course, all that
strong any more).
And, yes, the signature is obviously something my backup tool needs.
> Tahoe uses a strong hash function to generate the immutable file caps>
> from the ciphertext so that we get this property:
>
> The "At Most One File Per Immutable File Cap" Property:
>
> There can exist (in this universe, in the forseeable future) at
> most one file matching any given immutable file capability.
>
> In cryptographic terms, this is called "collision-resistance".
Yes, and I need the same property -- and so does rsync. If two different
files were to generate the same signature, then rsync would believe that
those two file were the same. In addition, a remote file being updated would
request the same set of patches to bring it into synchronization with either
of the two putative same-signature files -- and those patches would be wrong
in one of the cases.
Hmmm.
I've been basing my lack of concern about the MD4 weaknesses on the (assumed)
lack of concern about rsync. Your questions prompted me to look into that
assumption, however, and I see that there IS concern about MD4 in the rsync
community. Their (as yet unimplemented) solution seems to be to provide a
random seed to MD4, which is probably the output of a strong hash of the file
content, and also to send a whole-file hash for final integrity checking.
MD4 is perfectly functional in a non-malicious environtment, of course. The
probability of two "real" files colliding is negligible.
But what about in the presence of a malicious attacker who wants to corrupt a
file?
Suppose the attacker knows a file on your system and knows your convergence
secret. Thus, the attacker knows the CHK of that file and therefore the
storage ID. Suppose also that the hash algorithm is weak and that the
attacker can generate a different file with the same hash, key and SID.
The attacker can then replace your file in the grid with his. This will
effectively destroy your file. You won't mistakenly download his file
thinking it's yours, though, because the UEB hash from your read cap will not
match his. Someone who has the encryption key but not the read cap could
download the file and believe it to be the original.
I don't think this is a problem. Or at least, it's not a problem that doesn't
exist even without the weak hash. If the attacker knows the storage ID of
your file, he can replace it in the grid -- he doesn't need to be able to
generate another file that hashes to the same value.
> This property might not be needed for some applications, but it is
> very tricky for application writers to know when they can safely rely
> on a weaker property, such as the property that cryptographers call
> "second-pre-image-resistance".
Actually, MD4 is vulnerable to second pre-image collision attacks (with
probability 2^-56, or 2^-27 if the attacker can modify parts of the original
file), so it doesn't even have that. As I explained above, I don't think
this is an issue, given that the file integrity is not dependent upon the
encryption key derivation. That is assured by the other hashes, which are
all SHAD-256
> Unfortunately the SHA-256d secure hash function imposes a significant
> CPU overhead. Currently I think that all actual uses of Tahoe are I/
> O-bound anyway, and have ridiculously overpowered CPUs anyway, so I
> don't think this CPU-overhead is causing an actual performance
> problem for anyone in practice, but hopefully Tahoe will move into
> more and more use cases, and as it does this might become a problem.
The process is I/O bound on most machines. There are exceptions, however.
One such is my current home file server, which is only a 1 GHz Athlon, but
has 6 SATA drives in it in a software RAID-6 configuration. That machine can
sustain read rates of nearly 100 MBps, BUT uses basically 100% of the CPU to
do it. Every bit of additional computation overhead decreases the
throughput -- and I doubt the machine could do SHAD-256 calculations at 100
MBps even if it weren't doing anything else (I should test that).
Another use case that I plan to try in the near future is to attach a big USB
drive to a Linksys router running custom firmware, and use that as a Tahoe
node. That's one easy way to work around the NAT problem, and it also
provides an easy way to add a "backup server" to a home network. Put a CIFS
share on the router, let users drop any files they want backed up into that
share, and then let the Tahoe node running on the router handle backing the
data up to the grid.
There are all sorts of ways in which that latter idea may be impractical --
maybe the router doesn't have enough RAM to run a node, for example. But
it's definitely one I want to explore, and it's one in which the CPU is
pretty anemic.
However, even if that use case were really crucial, I don't think you want to
look into swapping out SHAD-256 for the file integrity computations. A
weaker hash doesn't seem to hurt key derivation, but file integrity is a
different kettle of fish.
> In the year 2012 (hey, we're living in the future!), the new SHA-3
> hash function will be chosen. That function will also, I hope,
> require about 1/3 as many CPU cycles as SHA-256 does while being a
> safer long-term bet.
If the result parallels the success of the AES selection process, it may be
even faster than that.
Shawn.
More information about the tahoe-dev
mailing list