[tahoe-dev] Global deduplication of encrypted files

Kenny Taylor kenny at corvettekenny.com
Thu May 5 14:21:43 PDT 2011


Regarding encrypted file stores and deduplication, SpiderOak published a 
good article on file-level deduplication of encrypted files:

https://spideroak.com/blog/20100827150530-why-spideroak-doesnt-de-duplicate-
data-across-users-and-why-it-should-worry-you-if-we-did

Wuala seems to use the method SpiderOak cautions against.  When a user 
tries to upload a file, the client app encrypts it, hashes it, and asks the 
network if an encrypted file already exists with the same hash.  If so, the 
existing file is linked into the user's account (no upload needed!).  It's 
a neat concept, but it has one big disadvantage:  the network can see each 
user who is sharing a file with a given hash.

So global file-level deduplication = bad.  Not necessarily true for 
block-level dedup.  Let's say we break a file into 8kb chunks, encrypt each 
chunk to the user's private key, then push those chunks to the network.  
The same file uploaded by different users would produce completely 
different block sets.  Maybe each storage node maintains a hash table of 
the blocks it's storing.  So when the client node pushes out a block, it 
queries the known storage nodes to see if someone is already holding a 
block with that hash.  The block size might need to be <= 4kB for that to 
be effective.

I realize that's a big departure from the existing tahoe architecture.  
Food for thought :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20110505/b7582e70/attachment.html>


More information about the tahoe-dev mailing list