[tahoe-dev] [p2p-hackers] convergent encryption reconsidered
Jim McCoy
mccoy at mad-scientist.com
Thu Mar 20 21:56:58 PDT 2008
On Mar 20, 2008, at 12:42 PM, zooko wrote:
> Security engineers have always appreciated that convergent
> encryption allows an attacker to perform a
> confirmation-of-a-file attack -- if the attacker already knows
> the full plaintext of a file, then they can check whether a
> given user has a copy of that file.
The truth of this depends on implementation details, and is an
assertion that cannot be said to cover all or even most of the
potential use-cases for this technique. This property only holds if
it is possible for the attacker to link a selected ciphertext/file to
a user. Systems which use convergent encryption to populate a shared
storage pool _might_ have this property, but is my no means a
certainty; if a system is implemented correctly is is not necessary
for users to expose their list of files in order to maintain this
shared storage space.
> basically people can tell which files you are storing if they
> are "publically known" files, but they can't learn anything
> about your own personal files.
It sounds like you have a design problem. If nodes that participate
in the system can distinguish between publication and _re_-publication/
replication (or whatever you want to call the random sharing of
arbitrary data blocks for the purposes of increasing file
availability) then you have a problem. If these two activities are
indistinguishable then an observer knows you have some blocks to a
file but should not be able to distinguish between you publishing the
blocks and the act of re-distribution to increase block availability.
> The Learn-Partial-Information Attack [...]
A better title for this would be "Chosen-Plaintext attack on
Convergent Encryption", since what you are talking about is really a
chosen plaintext attack. To be a bit more specific, this is really
just a version of a standard dictionary attack. The solution to this
problem is to look at similar systems that suffered from dictionary
attacks an see what solutions were created to solve the problem.
The most widely known and studied version of this is the old crypt()/
passwd problem.
> For another example, if you use Tahoe to backup your entire
> home directory, or your entire filesystem, then the attacker
> gains the opportunity to try to learn partial information about
> various files which are of predictable format but have
> sensitive fields in them, such as .my.cnf (MySQL configuration
> files), .htpasswd, .cvspass, .netrc, web browser cookie files,
> etc..
The problem with this imagined attack are twofold. I will use your
Tahoe example for my explanations because I have a passing familiarity
with the architecture. The first problem is isolating the original
ciphertext in the pool of storage. If a file is encrypted using
convergent encryption and then run through an error-correction
mechanism to generate a number of shares that make up the file an
attacker first needs to be able to isolate these shares to generate
the orginal ciphertext. FEC decoding speeds may be reasonably fast,
but they are not without some cost. If the storage pool is
sufficiently large and you are doing your job to limit the ability of
an attacker to see which blocks are linked to the same FEC operation
then the computational complexity of this attack is significantly
higher than you suggest.
Assuming an all-seeing oracle who can watch every bit sent into the
storage pool will get us around this first problem, but it does raise
the bar for potential attackers.
The second problem an attacker now faces is deciding what sort of
format a file might have, what the low-entropy content might be, and
then filling in values for these unknowns. If your block size is
small (and I mean "really small" in the context of the sort of systems
we are talking about) there might be only a few kilobits of entropy in
the first couple of blocks of a file so either a rainbow-table attack
on known file formats or a dedicated effort to grab a specific file
might be possible, but this is by no means certain. Increase your
block size and this problem becomes much harder for the attacker.
> Defense Against Both Attacks
>
> [...]
> However, we can do better than that by creating a secret value
> and mixing that value into the per-file encryption key (so
> instead of symmetric_key = H(plaintext), you have symmetric_key
> = H(added_secret, plaintext), where "," denotes an unambiguous
> encoding of both operands). This idea is due to Brian Warner
> and Drew Perttula.
Bad idea, for a couple of reasons. This first problem is that you are
assuming the added secret is adding a significant amount of entropy
and the second is that you are throwing out the advantage of
convergent encryption. If the secret is shared across multiple files
then I can run a known-plaintext attack on your system using a file
that I assume I have in common with my target (e.g. a standard, small,
OS file) to get their added_secret and then once I know my target's
secret I move on to the file I really want to go after. If your
userbase consists of paranoid cyperpunks then the added secret just
becomes a lookup in a rainbow table, and if they are the sort of
userbase that some people like to call "the consumer market" then the
entropy being added here is negligible.
A better solution would be to look at something like bcrypt() (see "A
Future-Adaptable Password Scheme") and use this mechanism for files
below a certain threshold. I do not think that the expensive key
scheduling operation discounts the use of CTR mode, so you are safe on
that side of things, but it would mean that you step away from AES...
> A Comment About Space Savings
>
> One of the original motivations for convergent encryption, as
> expressed in Doceur's technical report, is to conserve storage
> space by coalescing identical files. [...] My suspicion is
> that the gains available for modern uses are nowhere near that
> good -- I wouldn't be surprised if it were less than 5% [...]
I would be very, very surprised if you were correct.
> My reasoning is:
> 1. The Doceur et al. test set was probably the ideal test set
> to highlight the advantages of coalescing: it was hundreds
> of workstations, which probably all followed the same
> Information Technology Department policies, which were
> probably homogeneous, and which were probably packed with
> many of the same software products (especially Microsoft
> products).
Are you suggesting the rest of the world uses software products that
are radically different than those used in a corporate environment :)
We all use a small set of operating systems, that share the same basic
set of files. We all use the same small set of application software
and even if there is a "long tail" in application software the amount
of storage this occupies compared to the operating system and most
common applications is likely to be a very small fraction of the total
being stored.
> 2. Movies and music. In 2002, on Windows workstations in an
> office at Microsoft, there probably weren't any movies. For
> some of the probable use cases for Tahoe, movies and
> collections of music are the largest component. The
> presence of movie files, which are large, could increase or
> decrease the proportion of savings from coalescing files,
> depending on the degree to which those files are shared
> among multiple users, which brings us to the next point:
This was actually a concern of mine when originally designing the
system, but for a different reason. At that time the large content
files we were concerned with were music. We knew that the amount of
available music was a bounded total, but at that point in time the
world had not converged on a standard set of codecs and standard
bitrates for these codecs. We worried about the fact that the same
song would end up with multiple copies in the pool because some people
would use LAME, some would use WinAMP, others a pirated Fraunhofer
codec, and none of them would use the same bitrate. This problem
still exists somewhat, but the fact that most people use only a few
packages for CD ripping, codecs have standardized, and most encoding
uses only a few standard bitrates makes the job much easier and the
storage savings to be realized significantly larger.
When it comes to movies, the fact that a lot of them are *cough*
"distributed" from a few master copies and seldom re-encoded means
that you are likely to see many copies of the same bits when it comes
to this category. DVD ripping is also a fairly standardized process,
so you win again.
Given the fact that people are not buying terabyte drives to store
their scans of Proust folios I think we can make an educated guess as
to what is taking up a lot of space on these drives and how much
duplication there is among these bits.
> 3. The Long Tail; Today there seems to be a more diverse set
> of things taking up a bigger portion of the total.
For all the talk of "the long tail" it seems that we all tend to visit
the same set of YouTube clips and archive a lot of the same
information. More info is being generated by diverse groups, but the
amount being generated still seems to be growing a lot slower than
available storage capacity. One side-effect of the long-tail is that
as groups become more diverse in their taste in information they still
tend to share the information that is specific to their interests.
You might see the storage advantages decline somewhat until you have a
significant userbase, but when considered in relation to point #2 this
is really a minor issue.
> 4. Personally-produced photos and movies. It seems like some
> of the files that people are most eager to back up and to
> share are the product of their own cameras. Such files will
> typically be shared by only one or a few users.
This is the only point I can agree upon. Personally-generated media
(which is distinct from "user-generated content") is what makes
individual systems unique. On a generic user's computer this breaks
down into three basic categories:
-Personal documents
-Photos
-"Home" movies
The size of the first is negligible. The second is large and getting
larger, and the third is probably a lot smaller than most of us
think. Photos are really the only one that will skew this result.
In conclusion I would have to say that the sky is not really falling.
The attack outlined is a known problem that can be solved in ways that
still preserve the advantages of convergent encryption.
More information about the tahoe-dev
mailing list