[tahoe-dev] [p2p-hackers] convergent encryption reconsidered

Thu Mar 20 21:56:58 PDT 2008

On Mar 20, 2008, at 12:42 PM, zooko wrote:

 >   Security engineers have always appreciated that convergent
 >   encryption allows an attacker to perform a
 >   confirmation-of-a-file attack -- if the attacker already knows
 >   the full plaintext of a file, then they can check whether a
 >   given user has a copy of that file.

The truth of this depends on implementation details, and is an  
assertion that cannot be said to cover all or even most of the  
potential use-cases for this technique.  This property only holds if  
it is possible for the attacker to link a selected ciphertext/file to  
a user.  Systems which use convergent encryption to populate a shared  
storage pool _might_ have this property, but is my no means a  
certainty; if a system is implemented correctly is is not necessary  
for users to expose their list of files in order to maintain this  
shared storage space.

 >   basically people can tell which files you are storing if they
 >   are "publically known" files, but they can't learn anything
 >   about your own personal files.

It sounds like you have a design problem.  If nodes that participate  
in the system can distinguish between publication and _re_-publication/ 
replication (or whatever you want to call the random sharing of  
arbitrary data blocks for the purposes of increasing file  
availability) then you have a problem.  If these two activities are  
indistinguishable then an observer knows you have some blocks to a  
file but should not be able to distinguish between you publishing the  
blocks and the act of re-distribution to increase block availability.

 >  The Learn-Partial-Information Attack [...]

A better title for this would be "Chosen-Plaintext attack on  
Convergent Encryption", since what you are talking about is really a  
chosen plaintext attack.  To be a bit more specific, this is really  
just a version of a standard dictionary attack.  The solution to this  
problem is to look at similar systems that suffered from dictionary  
attacks an see what solutions were created to solve the problem.

The most widely known and studied version of this is the old crypt()/ 
passwd problem.

 >   For another example, if you use Tahoe to backup your entire
 >   home directory, or your entire filesystem, then the attacker
 >   gains the opportunity to try to learn partial information about
 >   various files which are of predictable format but have
 >   sensitive fields in them, such as .my.cnf (MySQL configuration
 >   files), .htpasswd, .cvspass, .netrc, web browser cookie files,
 >   etc..

The problem with this imagined attack are twofold.  I will use your  
Tahoe example for my explanations because I have a passing familiarity  
with the architecture.  The first problem is isolating the original  
ciphertext in the pool of storage.  If a file is encrypted using  
convergent encryption and then run through an error-correction  
mechanism to generate a number of shares that make up the file an  
attacker first needs to be able to isolate these shares to generate  
the orginal ciphertext.  FEC decoding speeds may be reasonably fast,  
but they are not without some cost.  If the storage pool is  
sufficiently large and you are doing your job to limit the ability of  
an attacker to see which blocks are linked to the same FEC operation  
then the computational complexity of this attack is significantly  
higher than you suggest.

Assuming an all-seeing oracle who can watch every bit sent into the  
storage pool will get us around this first problem, but it does raise  
the bar for potential attackers.

The second problem an attacker now faces is deciding what sort of  
format a file might have, what the low-entropy content might be, and  
then filling in values for these unknowns.  If your block size is  
small (and I mean "really small" in the context of the sort of systems  
we are talking about) there might be only a few kilobits of entropy in  
the first couple of blocks of a file so either a rainbow-table attack  
on known file formats or a dedicated effort to grab a specific file  
might be possible, but this is by no means certain.  Increase your  
block size and this problem becomes much harder for the attacker.

 >  Defense Against Both Attacks
 >
 >   [...]
 >   However, we can do better than that by creating a secret value
 >   and mixing that value into the per-file encryption key (so
 >   instead of symmetric_key = H(plaintext), you have symmetric_key
 >   = H(added_secret, plaintext), where "," denotes an unambiguous
 >   encoding of both operands). This idea is due to Brian Warner
 >   and Drew Perttula.

Bad idea, for a couple of reasons.  This first problem is that you are  
assuming the added secret is adding a significant amount of entropy  
and the second is that you are throwing out the advantage of  
convergent encryption.  If the secret is shared across multiple files  
then I can run a known-plaintext attack on your system using a file  
that I assume I have in common with my target (e.g. a standard, small,  
OS file) to get their added_secret and then once I know my target's  
secret I move on to the file I really want to go after.  If your  
userbase consists of paranoid cyperpunks then the added secret just  
becomes a lookup in a rainbow table, and if they are the sort of  
userbase that some people like to call "the consumer market" then the  
entropy being added here is negligible.

A better solution would be to look at something like bcrypt() (see "A  
Future-Adaptable Password Scheme") and use this mechanism for files  
below a certain threshold.  I do not think that the expensive key  
scheduling operation discounts the use of CTR mode, so you are safe on  
that side of things, but it would mean that you step away from AES...

 >  A Comment About Space Savings
 >
 >   One of the original motivations for convergent encryption, as
 >   expressed in Doceur's technical report, is to conserve storage
 >   space by coalescing identical files. [...] My suspicion is
 >   that the gains available for modern uses are nowhere near that
 >   good -- I wouldn't be surprised if it were less than 5% [...]

I would be very, very surprised if you were correct.

 >   My reasoning is:
 >    1. The Doceur et al. test set was probably the ideal test set
 >       to highlight the advantages of coalescing: it was hundreds
 >       of workstations, which probably all followed the same
 >       Information Technology Department policies, which were
 >       probably homogeneous, and which were probably packed with
 >       many of the same software products (especially Microsoft
 >       products).

Are you suggesting the rest of the world uses software products that  
are radically different than those used in a corporate environment :)   
We all use a small set of operating systems, that share the same basic  
set of files.  We all use the same small set of application software  
and even if there is a "long tail" in application software the amount  
of storage this occupies compared to the operating system and most  
common applications is likely to be a very small fraction of the total  
being stored.

 >    2. Movies and music. In 2002, on Windows workstations in an
 >       office at Microsoft, there probably weren't any movies. For
 >       some of the probable use cases for Tahoe, movies and
 >       collections of music are the largest component. The
 >       presence of movie files, which are large, could increase or
 >       decrease the proportion of savings from coalescing files,
 >       depending on the degree to which those files are shared
 >       among multiple users, which brings us to the next point:

This was actually a concern of mine when originally designing the  
system, but for a different reason.  At that time the large content  
files we were concerned with were music.  We knew that the amount of  
available music was a bounded total, but at that point in time the  
world had not converged on a standard set of codecs and standard  
bitrates for these codecs.  We worried about the fact that the same  
song would end up with multiple copies in the pool because some people  
would use LAME, some would use WinAMP, others a pirated Fraunhofer  
codec, and none of them would use the same bitrate.  This problem  
still exists somewhat, but the fact that most people use only a few  
packages for CD ripping, codecs have standardized, and most encoding  
uses only a few standard bitrates makes the job much easier and the  
storage savings to be realized significantly larger.

When it comes to movies, the fact that a lot of them are *cough*  
"distributed" from a few master copies and seldom re-encoded means  
that you are likely to see many copies of the same bits when it comes  
to this category.  DVD ripping is also a fairly standardized process,  
so you win again.

Given the fact that people are not buying terabyte drives to store  
their scans of Proust folios I think we can make an educated guess as  
to what is taking up a lot of space on these drives and how much  
duplication there is among these bits.

 >    3. The Long Tail; Today there seems to be a more diverse set
 >      of things taking up a bigger portion of the total.

For all the talk of "the long tail" it seems that we all tend to visit  
the same set of YouTube clips and archive a lot of the same  
information.  More info is being generated by diverse groups, but the  
amount being generated still seems to be growing a lot slower than  
available storage capacity.  One side-effect of the long-tail is that  
as groups become more diverse in their taste in information they still  
tend to share the information that is specific to their interests.   
You might see the storage advantages decline somewhat until you have a  
significant userbase, but when considered in relation to point #2 this  
is really a minor issue.

 >    4. Personally-produced photos and movies. It seems like some
 >       of the files that people are most eager to back up and to
 >       share are the product of their own cameras. Such files will
 >       typically be shared by only one or a few users.

This is the only point I can agree upon.  Personally-generated media  
(which is distinct from "user-generated content") is what makes  
individual systems unique.  On a generic user's computer this breaks  
down into three basic categories:

	-Personal documents
	-Photos
	-"Home" movies

The size of the first is negligible.  The second is large and getting  
larger, and the third is probably a lot smaller than most of us  
think.  Photos are really the only one that will skew this result.

In conclusion I would have to say that the sky is not really falling.   
The attack outlined is a known problem that can be solved in ways that  
still preserve the advantages of convergent encryption.