[tahoe-dev] Global deduplication of encrypted files

Sat May 7 11:17:49 PDT 2011

On 5/5/11 2:21 PM, Kenny Taylor wrote:

> So global file-level deduplication = bad. Not necessarily true for
> block-level dedup. Let's say we break a file into 8kb chunks, encrypt
> each chunk to the user's private key, then push those chunks to the
> network. The same file uploaded by different users would produce
> completely different block sets. Maybe each storage node maintains a
> hash table of the blocks it's storing. So when the client node pushes
> out a block, it queries the known storage nodes to see if someone is
> already holding a block with that hash. The block size might need to
> be <= 4kB for that to be effective.
> 
> I realize that's a big departure from the existing tahoe architecture. 
> Food for thought :)

Hm, if each client is asking the storage nodes about the hash of each
block, then you're basically publishing the whole list of block
identifiers (even though the contents of each block are only known to
someone else who already has those contents). And you're probably
publishing them in order, since it's easiest to uploading the file from
start to finish. So the storage servers get to see even more information
about who's uploading what than if you hash whole files at once. They
could determine that both Alice and Bob are uploading files that share
most of the same contents, but only differ in some specific blocks.

And.. the partial-information-guessing attack is most interesting
against something like an /etc/mysql.conf file, which could be large,
but varies only in a single line, and that line contains a
relatively-low-entropy password. Suppose this means that all copies of
mysql.conf are 10 blocks long (4kB each), and all are identical except
for the first block. If you publish the hashes of each block, the server
gets to use the last 9 blocks to deduce that this is a mysql.conf file
for sure, then does a dictionary attack on the hash of the first block
to figure out your password. If you did whole-file hashing, the server
would have less evidence that this was a mysql.conf file (maybe
filesize, or sequencing of this upload relative to the start of what
looks like a whole-disk backup), and could therefore not justify
spending as much CPU time on the attack. Also, if there were two
passwords in the file, say one near the start and one near the end, then
whole-file hashing would necessitate a harder dictionary attack (against
both passwords at once), while smaller block sizes would let them
perform two separate (easier) attacks.

I'm still in favor of the convergence secret and whole-file hashing,
especially if there's a UI which lets users decide when they want
privacy and when they want deduplication. The lack of a comfortable
heuristic is what prompted us to make all Tahoe uploads use the
convergence secret, but if I were building a backup product these days,
I'd consider using OS-directory-layout -specific cues to distinguish
between "public" and "private" files. On linux, /usr could be public,
while /home and /etc would be private. On OS-X, /Applications and
/System could be public (and more likely to be a dedup win), while of
course /Users is private (and less likely to be a win). This would have
to be imlemented in the CLI tools that drive the backup, since by the
time the upload's file data arrives at the Tahoe client node (via an
HTTP PUT to /uri), the filename has been lost.

cheers,
 -Brian