[tahoe-dev] convergent encryption reconsidered

Thu Mar 20 12:42:18 PDT 2008

(This is an ASCII rendering of https://zooko.com/ 
convergent_encryption_reconsidered.html .)

                  Convergent Encryption Reconsidered

    Written by Zooko Wilcox-O'Hearn, documenting ideas due to Drew
    Perttula, Brian Warner, and Zooko Wilcox-O'Hearn, 2008-03-20.

   Abstract

    Convergent encryption is already known to suffer from a
    confirmation-of-a-file attack. We show that it suffers also
    from a learn-partial-information attack. The conditions under
    which this attack works cannot be predicted by a computer
    program nor by an unsophisticated user. We propose a solution
    which trades away part of the space savings benefits of
    convergent encryption in order to prevent this new attack. Our
    defense also prevents the old attack. The issues are presented
    in the context of the Tahoe Least-AUthority Grid File System, a
    secure decentralized filesystem.

   Background -- The Confirmation-Of-A-File Attack

    Convergent encryption, also known as content hash keying, was
    first mentioned by John Pettitt on the cypherpunks list in 1996
    [1], was used by Freenet [2] and Mojo Nation [3] in 2000, and
    was analyzed in a technical report by John Doceur et al. in
    2002 [4]. Today it is used by at least Freenet, GNUnet [5],
    flud [6], and the Tahoe Least-AUthority Grid File System [7].
    The remainder of this note will focus on the Tahoe LAUGFS
    filesystem. The use of convergent encryption in other systems
    may have different consequences than described here, because of
    the different use cases or added defenses that those systems
    may have.

    Convergent encryption is simply encrypting a file using a
    symmetric encryption key which is the secure hash of the
    plaintext of the file.

    Security engineers have always appreciated that convergent
    encryption allows an attacker to perform a
    confirmation-of-a-file attack -- if the attacker already knows
    the full plaintext of a file, then they can check whether a
    given user has a copy of that file.

    Whether this confirmation-of-a-file attack is a security or
    privacy problem depends on the situation. If you want to store
    banned books or political pamphlets without attracting the
    attention of an oppressive government, or store pirated copies
    of music or movies without attracting the attention of
    copyright holders, then the confirmation-of-a-file attack is
    potentially a critical problem. On the other hand, if the
    sensitive parts of your data are secret personal things like
    your bank account number, passwords, and so forth, then it
    isn't a problem. Or so I -- and as far as I know everyone else
    -- thought until March 16, 2008.

    I had planned to inform users of the current version of Tahoe
    -- version 0.9.0 -- about the confirmation-of-a-file attack by
    adding a FAQ entry:

      Q: Can anyone else see the contents of files that I have not
      shared?

      A: The files that you store are encrypted so that nobody can
      see a file's contents (unless of course you intentionally
      share the file with them). However, if the file that you
      store is something that someone has already seen, such as if
      it is a file that you downloaded from the Internet in the
      first place, then they can recognize it as being the same
      file when you store it, even though it is encrypted. So
      basically people can tell which files you are storing if they
      are "publically known" files, but they can't learn anything
      about your own personal files.

    However, four days ago (on March 16, 2008) Drew Perttula and
    Brian Warner came up with an attack that shows that the above
    FAQ is wrong.

   The Learn-Partial-Information Attack

    They extended the confirmation-of-a-file attack into the
    learn-partial-information attack. In this new attack, the
    attacker learns some information from the file. This is done by
    trying possible values for unknown parts of a file and then
    checking whether the result matches the observed ciphertext.
    For example, if you store a document such as a form letter from
    your bank, which contains a few pages of boilerplate legal text
    plus a few important parts, such as your bank account number
    and password, then an attacker who knows the boilerplate might
    be able to learn your account number and password.

    For another example, if you use Tahoe to backup your entire
    home directory, or your entire filesystem, then the attacker
    gains the opportunity to try to learn partial information about
    various files which are of predictable format but have
    sensitive fields in them, such as .my.cnf (MySQL configuration
    files), .htpasswd, .cvspass, .netrc, web browser cookie files,
    etc.. In some cases, files such as these will contain too much
    entropy from the perspective of the attacker to allow this
    attack, but in other cases the attacker will know, or be able
    to guess, most of the fields, and brute force the remainder.

    Designers of these systems -- MySQL, Apache, etc. -- know that
    user secrets are often guessable (increasingly often,
    nowadays), which is why such applications limit the rate at
    which clients can attempt to login and try passwords. From a
    cryptographer's perspective, using Tahoe with convergent
    encryption on all files allows an attacker to use off-line
    attacks instead of on-line attacks, which renders vulnerable
    some secrets which were formerly safe.

    The amount of information that the attacker can learn using
    this attack is upper-bounded by two things: first, by the
    attacker's computational capacity. If the attacker can perform
    no more than 2^64 computations, then he can learn no more than
    64 bits worth of information from any file. Note that he does
    not have to spend all of this computation attacking a single
    file owned by a single user -- instead he can attack many users
    and many files in parallel with the same computation. (He can
    also burn his resulting values into a rainbow table on DVD and
    sell it on the Web so that the buyer can attack other users
    using the result of the attacker's computation. This is
    currently done for hashed or encrypted passwords.)

    The second upper bound is the amount of entropy in the file
    from the attacker's perspective. If the file contains more
    entropy, from the attacker's perspective, than his
    computational capacity, then he learns nothing about the file
    (except that the file was not any of the things that he guessed
    that it might be). Note that the amount of entropy is relative
    to the attacker, not intrinsic to the file. You may think that
    a PDF file with millions of bytes in it would have a lot of
    entropy, but if it happens to be a form letter from your bank,
    and if the attacker happened to receive the same letter with
    only a few fields different, then it contains little entropy to
    him. This subtlety may underlie the failure of many, including
    myself, to understand this issue sooner. Observe also that it
    is not possible for a computer program to determine whether a
    given file has sufficient entropy -- the answer to that
    question depends on what the attacker knows and on how
    sophisticated and accurate is his model of the victim.

    (Note: ideas like this are often Obvious in Retrospect. If this
    one was Obvious in Forespect to anyone, I would appreciate
    references. I've scoured the citations mentioned in this note
    and found no hint of it.)

   Defense Against Both Attacks

    What can we do about this? Well, in Tahoe the application which
    uses the secure filesystem, or even the human user which uses
    the application, can choose to use convergent encryption or not
    on a per-file basis, and there is no backwards-compatibility
    problem (Tahoe v0.9.0 will be able read files which are written
    with or without convergent encryption).

    However, we can do better than that by creating a secret value
    and mixing that value into the per-file encryption key (so
    instead of symmetric_key = H(plaintext), you have symmetric_key
    = H(added_secret, plaintext), where "," denotes an unambiguous
    encoding of both operands). This idea is due to Brian Warner
    and Drew Perttula.

    The set of people who know this added secret is the set of
    people whose files can converge, and it is also the set of
    people who are able to perform either of the two attacks
    described above. This means that attackers with whom you do not
    share this added secret are prevented from performing either
    attack on your files.

    One policy which is easy to implement is for each user to keep
    the added secret to themselves. This would make successive
    uploads of the same file by the same user converge, but would
    not achieve convergence with any other user's files, and would
    fully protect the users against these two attacks.

    There are various possibilities for how to automatically decide
    whether or not to use convergent encryption on a given file or
    set of files -- heuristics based on file size, name or
    location, white lists or black lists of files, and perhaps ways
    that the user interface could make the alternatives apparent to
    the user. These ideas are out of the scope of this note (and,
    for now, of the Tahoe decentralized filesystem itself -- the
    application written atop Tahoe can decide).

    It should be noted that if the exact size of a file is divulged
    then this information can be used for a confirmation-of-a-file
    attack, if there are few likely files of that exact size.
    Adding padding to files before encryption can substantially
    reduce the effectiveness of that attack vector, as is already
    well known.

   A Comment About Space Savings

    One of the original motivations for convergent encryption, as
    expressed in Doceur's technical report, is to conserve storage
    space by coalescing identical files. That technical report
    cited an experiment by Doceur et al. at Microsoft in 2002 which
    showed that coalescing all files on a set of 585 Windows
    workstations resulted in a 50% space savings. My suspicion is
    that the gains available for modern uses are nowhere near that
    good -- I wouldn't be surprised if it were less than 5% for
    typical uses of the Tahoe Least-AUthority Grid File System.

    My reasoning is:
     1. The Doceur et al. test set was probably the ideal test set
        to highlight the advantages of coalescing: it was hundreds
        of workstations, which probably all followed the same
        Information Technology Department policies, which were
        probably homogeneous, and which were probably packed with
        many of the same software products (especially Microsoft
        products).
     2. Movies and music. In 2002, on Windows workstations in an
        office at Microsoft, there probably weren't any movies. For
        some of the probable use cases for Tahoe, movies and
        collections of music are the largest component. The
        presence of movie files, which are large, could increase or
        decrease the proportion of savings from coalescing files,
        depending on the degree to which those files are shared
        among multiple users, which brings us to the next point:
     3. The Long Tail; Today there seems to be a more diverse set
        of things taking up a bigger portion of the total.
     4. Personally-produced photos and movies. It seems like some
        of the files that people are most eager to back up and to
        share are the product of their own cameras. Such files will
        typically be shared by only one or a few users.

    The question of how much space is saved by convergence in a
    typical use of Tahoe remains open. Hopefully someone will
    perform empirical measurements to answer this question in the
    near future.

   Conclusions

    Convergent encryption renders user files vulnerable to a
    confirmation-of-a-file attack. We already knew that. It also
    renders user files vulnerable to a learn-partial-information
    attack in subtle ways. We didn't think of this until now. My
    search of the literature suggests that nobody else did either.

    The addition of an added secret to be mixed into the symmetric
    encryption key allows you to limit the scope of users with whom
    your files will converge. Attackers who are outside of this set
    of users cannot use the new learn-partial-information, nor can
    they use the old confirmation-of-a-file attack.

    The Tahoe Least-AUthority Grid File System will probably
    release v0.9.1, to succeed the current v0.9.0 release, which
    will add the feature of mixing in an added secret to the
    symmetric encryption key when doing convergent encryption,
    which will turn that feature on by default, and which will
    perhaps offer a way through the API and or UI to control that
    feature on a per-user and per-file basis.

     Citations

    [1] http://cypherpunks.venona.com/date/1996/02/msg02013.html

    [2]
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.4919

    [3]
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.9607

    [4]
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.7586

    [5] http://gnunet.org

    [6] http://flud.org

    [7] http://allmydata.org

     Zooko

    Last modified: Sat May 5 15:13:38 MDT 2007