[tahoe-dev] convergent encryption reconsidered
zooko
zooko at zooko.com
Thu Mar 20 12:42:18 PDT 2008
(This is an ASCII rendering of https://zooko.com/
convergent_encryption_reconsidered.html .)
Convergent Encryption Reconsidered
Written by Zooko Wilcox-O'Hearn, documenting ideas due to Drew
Perttula, Brian Warner, and Zooko Wilcox-O'Hearn, 2008-03-20.
Abstract
Convergent encryption is already known to suffer from a
confirmation-of-a-file attack. We show that it suffers also
from a learn-partial-information attack. The conditions under
which this attack works cannot be predicted by a computer
program nor by an unsophisticated user. We propose a solution
which trades away part of the space savings benefits of
convergent encryption in order to prevent this new attack. Our
defense also prevents the old attack. The issues are presented
in the context of the Tahoe Least-AUthority Grid File System, a
secure decentralized filesystem.
Background -- The Confirmation-Of-A-File Attack
Convergent encryption, also known as content hash keying, was
first mentioned by John Pettitt on the cypherpunks list in 1996
[1], was used by Freenet [2] and Mojo Nation [3] in 2000, and
was analyzed in a technical report by John Doceur et al. in
2002 [4]. Today it is used by at least Freenet, GNUnet [5],
flud [6], and the Tahoe Least-AUthority Grid File System [7].
The remainder of this note will focus on the Tahoe LAUGFS
filesystem. The use of convergent encryption in other systems
may have different consequences than described here, because of
the different use cases or added defenses that those systems
may have.
Convergent encryption is simply encrypting a file using a
symmetric encryption key which is the secure hash of the
plaintext of the file.
Security engineers have always appreciated that convergent
encryption allows an attacker to perform a
confirmation-of-a-file attack -- if the attacker already knows
the full plaintext of a file, then they can check whether a
given user has a copy of that file.
Whether this confirmation-of-a-file attack is a security or
privacy problem depends on the situation. If you want to store
banned books or political pamphlets without attracting the
attention of an oppressive government, or store pirated copies
of music or movies without attracting the attention of
copyright holders, then the confirmation-of-a-file attack is
potentially a critical problem. On the other hand, if the
sensitive parts of your data are secret personal things like
your bank account number, passwords, and so forth, then it
isn't a problem. Or so I -- and as far as I know everyone else
-- thought until March 16, 2008.
I had planned to inform users of the current version of Tahoe
-- version 0.9.0 -- about the confirmation-of-a-file attack by
adding a FAQ entry:
Q: Can anyone else see the contents of files that I have not
shared?
A: The files that you store are encrypted so that nobody can
see a file's contents (unless of course you intentionally
share the file with them). However, if the file that you
store is something that someone has already seen, such as if
it is a file that you downloaded from the Internet in the
first place, then they can recognize it as being the same
file when you store it, even though it is encrypted. So
basically people can tell which files you are storing if they
are "publically known" files, but they can't learn anything
about your own personal files.
However, four days ago (on March 16, 2008) Drew Perttula and
Brian Warner came up with an attack that shows that the above
FAQ is wrong.
The Learn-Partial-Information Attack
They extended the confirmation-of-a-file attack into the
learn-partial-information attack. In this new attack, the
attacker learns some information from the file. This is done by
trying possible values for unknown parts of a file and then
checking whether the result matches the observed ciphertext.
For example, if you store a document such as a form letter from
your bank, which contains a few pages of boilerplate legal text
plus a few important parts, such as your bank account number
and password, then an attacker who knows the boilerplate might
be able to learn your account number and password.
For another example, if you use Tahoe to backup your entire
home directory, or your entire filesystem, then the attacker
gains the opportunity to try to learn partial information about
various files which are of predictable format but have
sensitive fields in them, such as .my.cnf (MySQL configuration
files), .htpasswd, .cvspass, .netrc, web browser cookie files,
etc.. In some cases, files such as these will contain too much
entropy from the perspective of the attacker to allow this
attack, but in other cases the attacker will know, or be able
to guess, most of the fields, and brute force the remainder.
Designers of these systems -- MySQL, Apache, etc. -- know that
user secrets are often guessable (increasingly often,
nowadays), which is why such applications limit the rate at
which clients can attempt to login and try passwords. From a
cryptographer's perspective, using Tahoe with convergent
encryption on all files allows an attacker to use off-line
attacks instead of on-line attacks, which renders vulnerable
some secrets which were formerly safe.
The amount of information that the attacker can learn using
this attack is upper-bounded by two things: first, by the
attacker's computational capacity. If the attacker can perform
no more than 2^64 computations, then he can learn no more than
64 bits worth of information from any file. Note that he does
not have to spend all of this computation attacking a single
file owned by a single user -- instead he can attack many users
and many files in parallel with the same computation. (He can
also burn his resulting values into a rainbow table on DVD and
sell it on the Web so that the buyer can attack other users
using the result of the attacker's computation. This is
currently done for hashed or encrypted passwords.)
The second upper bound is the amount of entropy in the file
from the attacker's perspective. If the file contains more
entropy, from the attacker's perspective, than his
computational capacity, then he learns nothing about the file
(except that the file was not any of the things that he guessed
that it might be). Note that the amount of entropy is relative
to the attacker, not intrinsic to the file. You may think that
a PDF file with millions of bytes in it would have a lot of
entropy, but if it happens to be a form letter from your bank,
and if the attacker happened to receive the same letter with
only a few fields different, then it contains little entropy to
him. This subtlety may underlie the failure of many, including
myself, to understand this issue sooner. Observe also that it
is not possible for a computer program to determine whether a
given file has sufficient entropy -- the answer to that
question depends on what the attacker knows and on how
sophisticated and accurate is his model of the victim.
(Note: ideas like this are often Obvious in Retrospect. If this
one was Obvious in Forespect to anyone, I would appreciate
references. I've scoured the citations mentioned in this note
and found no hint of it.)
Defense Against Both Attacks
What can we do about this? Well, in Tahoe the application which
uses the secure filesystem, or even the human user which uses
the application, can choose to use convergent encryption or not
on a per-file basis, and there is no backwards-compatibility
problem (Tahoe v0.9.0 will be able read files which are written
with or without convergent encryption).
However, we can do better than that by creating a secret value
and mixing that value into the per-file encryption key (so
instead of symmetric_key = H(plaintext), you have symmetric_key
= H(added_secret, plaintext), where "," denotes an unambiguous
encoding of both operands). This idea is due to Brian Warner
and Drew Perttula.
The set of people who know this added secret is the set of
people whose files can converge, and it is also the set of
people who are able to perform either of the two attacks
described above. This means that attackers with whom you do not
share this added secret are prevented from performing either
attack on your files.
One policy which is easy to implement is for each user to keep
the added secret to themselves. This would make successive
uploads of the same file by the same user converge, but would
not achieve convergence with any other user's files, and would
fully protect the users against these two attacks.
There are various possibilities for how to automatically decide
whether or not to use convergent encryption on a given file or
set of files -- heuristics based on file size, name or
location, white lists or black lists of files, and perhaps ways
that the user interface could make the alternatives apparent to
the user. These ideas are out of the scope of this note (and,
for now, of the Tahoe decentralized filesystem itself -- the
application written atop Tahoe can decide).
It should be noted that if the exact size of a file is divulged
then this information can be used for a confirmation-of-a-file
attack, if there are few likely files of that exact size.
Adding padding to files before encryption can substantially
reduce the effectiveness of that attack vector, as is already
well known.
A Comment About Space Savings
One of the original motivations for convergent encryption, as
expressed in Doceur's technical report, is to conserve storage
space by coalescing identical files. That technical report
cited an experiment by Doceur et al. at Microsoft in 2002 which
showed that coalescing all files on a set of 585 Windows
workstations resulted in a 50% space savings. My suspicion is
that the gains available for modern uses are nowhere near that
good -- I wouldn't be surprised if it were less than 5% for
typical uses of the Tahoe Least-AUthority Grid File System.
My reasoning is:
1. The Doceur et al. test set was probably the ideal test set
to highlight the advantages of coalescing: it was hundreds
of workstations, which probably all followed the same
Information Technology Department policies, which were
probably homogeneous, and which were probably packed with
many of the same software products (especially Microsoft
products).
2. Movies and music. In 2002, on Windows workstations in an
office at Microsoft, there probably weren't any movies. For
some of the probable use cases for Tahoe, movies and
collections of music are the largest component. The
presence of movie files, which are large, could increase or
decrease the proportion of savings from coalescing files,
depending on the degree to which those files are shared
among multiple users, which brings us to the next point:
3. The Long Tail; Today there seems to be a more diverse set
of things taking up a bigger portion of the total.
4. Personally-produced photos and movies. It seems like some
of the files that people are most eager to back up and to
share are the product of their own cameras. Such files will
typically be shared by only one or a few users.
The question of how much space is saved by convergence in a
typical use of Tahoe remains open. Hopefully someone will
perform empirical measurements to answer this question in the
near future.
Conclusions
Convergent encryption renders user files vulnerable to a
confirmation-of-a-file attack. We already knew that. It also
renders user files vulnerable to a learn-partial-information
attack in subtle ways. We didn't think of this until now. My
search of the literature suggests that nobody else did either.
The addition of an added secret to be mixed into the symmetric
encryption key allows you to limit the scope of users with whom
your files will converge. Attackers who are outside of this set
of users cannot use the new learn-partial-information, nor can
they use the old confirmation-of-a-file attack.
The Tahoe Least-AUthority Grid File System will probably
release v0.9.1, to succeed the current v0.9.0 release, which
will add the feature of mixing in an added secret to the
symmetric encryption key when doing convergent encryption,
which will turn that feature on by default, and which will
perhaps offer a way through the API and or UI to control that
feature on a per-user and per-file basis.
Citations
[1] http://cypherpunks.venona.com/date/1996/02/msg02013.html
[2]
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.4919
[3]
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.9607
[4]
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.7586
[5] http://gnunet.org
[6] http://flud.org
[7] http://allmydata.org
Zooko
Last modified: Sat May 5 15:13:38 MDT 2007
More information about the tahoe-dev
mailing list