[tahoe-dev] Storing a small file leads to a weird read capability

Wed Apr 7 11:57:06 PDT 2010

On 4/7/10 5:09 AM, Francois Deppierraz wrote:
>
> URI:LIT:ge3qaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

> It looks weird to me to see a cryptographically-secure identifier
> which doesn't look random enough. Wouldn't such feature lead to
> potential attacks?

Ah! What an excellent question! The short answer is no, this is
perfectly secure, but the reason why is a great thing to discuss.

(if you have time to think but not to read, just ponder the implications
of this:
  % echo -n "" | tahoe put
  URI:LIT:
)

Filecaps (because they are capabilities) are required to be "necessary
and sufficient" to access the resource that they represent (in this
case, the ability to read/get/learn the bytes of the file).

 * necessary: if you don't know the filecap, you cannot get the file.
              This requires unforgeability (I cannot independently
              create a valid filecap for data that I don't know). In a
              distributed system, unforgeability is implemented with
              unguessable strings.

 * sufficient: If I have the filecap, I don't need any other secrets or
               abilities ("non-ambient authority") to access the file.
               Filecaps are transferrable without appealing to some
               central admin or gatekeeper.

LIT filecaps are simply the base32 encoding of the file data, and are
used for very small files (I think the threshold is 65 bytes, which is
the break-even point at which the LIT filecap is the same length as a
typical CHK filecap). They are sufficient (you don't even need network
access to turn the LIT filecap into the data), and necessary (if you
don't know the filecap for my data, you can't figure out the data).

I frequently use a physical analogy. Suppose that we're sitting in a
room next to each other (i.e. we have a pre-established secure
connection) and I have a book in my hand that I want you to read,
because I think it's cool. You have pockets. There are marauding
intruders circling outside the room who can snatch things out of your
hands or put other things in them (but your pockets are safe). Our hands
are safe as long as we're inside the room, but rooms are for meetings,
not for reading, so you want to read my book at home later.

You want to read the book too, but your time is very limited, so you
want to make sure you only read my cool book. One of the intruders is a
furniture broker[1] who, the moment you leave the safety of the room,
will fill your hands with an Ikea catalog and interior design magazines,
and you don't want to accidentally read this garbage instead of my book
(this is the integrity/sufficiency property). Also, my book is about
something embarrassing, sensitive, and controversial: our mutual
admiration of the Git version control system, so we want the ability to
keep the identity of the book secret from the Mercurial
torch-and-pitchfork mob outside[2] (this is the
confidentiality/necessity property). Of course, you might elect to
reveal your Git-fondness, which is your own business, and over which I
have no control, but the system must have the property that we *can*
retain confidentiality if we want to.

Now, how can you get home with a copy of the right book, privately?
There are two main options:

 1: I use the xerox machine in this room to copy the whole book, then
    hand you the big stack of paper that comes out. You jam the whole
    stack into your pocket.

 2: I use the xerox machine to copy just the back cover, which includes
    the ISBN number, and hand you the single page that comes out, and you
    put that in your pocket. (we assume you can later buy the book
    anonymously, and that ISBNs are strong/immutable references to a
    specific edition)

The first is equivalent to emailing me a file, or storing the whole file
on your computer. The second is equivalent to uploading the file into
Tahoe and then emailing me the filecap, or storing the filecap on your
computer (perhaps as your "rootcap").

The "sufficient" property is provided either directly (you now have a
copy of the full book in your pocket) or by the combination of the safe
reference in your pocket and the immutable mapping property of ISBNs.
The intruder who wants to cause you to read a different book cannot
intercept+replace the thing you have in your pocket, nor can they
subvert the publishing industry to violate the ISBN-to-content mapping.

The "necessary" property is provided by virtue of the fact that the
intruder cannot see what I'm handing to you inside the room, or look
inside your pocket later. The thing I give you is necessary: the
intruders (who do not have it) cannot access the right book.

The first involves a full copy of the data, which is expensive (in
bandwidth, or storage costs, or pockets), at least in the marginal case
where you've already uploaded the file to tahoe and are now looking to
hand out a new copy. The cost is proportional to the size of the file.
The second is a cheap fixed cost, proportional to the size of a filecap.

Now, a LIT filecap is analogous to a really tiny book, perhaps just a
single page. It's just as cheap for me to hand you a single page that
contains the whole document as it is to hand you a single page that
contains the ISBN of the document. It is sufficient, because you now
have everything you need to read the book, and it is necessary, because
without knowing what I handed you, the intruder cannot find out what
book you're reading.

(ok, really, I delve into this sort of analogy when I'm talking about
signatures and secure data distribution schemes, but I wanted to work
out some of the terminology. Besides, the idea of stuffing a whole
xeroxed book into my pants pocket makes me laugh.)

Another view:

The confidentiality of a CHK file can be evaluated by assuming the
attacker gets the ciphertext (but not the filecap), and access to some
sort of confirmation mechanism (known as an "oracle" in the
cryptographic literature). If they're trying to guess your login
password, then the oracle is to try to use the password to actually log
in. If the encrypted file contains a secure hash of the plaintext (or
any error-checking mechanism at all), then the oracle is to try to
decrypt the file and then check to see if the error-checking codes look
ok. (in this case the oracle is not perfect: sometimes it will give you
false positives. I think this is known as a "random oracle", which gives
you some probability of saying "yes" that is influenced by the accuracy
of your guess).

The CHK mechanism is considered secure if the effort the attacker must
expend to get your plaintext is sufficiently high (no better than random
guessing). But it's a relative thing: does the attacker who already
knows thing X get any advantage by learning thing Y? For Tahoe, we
assume that attackers (including the storage server) get the shares that
you upload, so they know the filesize and the ciphertext. We don't
currently include a hash of the plaintext, but for argument's sake let's
assume that there is enough error-checking data in the plaintext to
allow the attacker to tell whether they've correctly decrypted the data.

The plaintext could be any possible N-byte string (they know the file
length, so they can rule out strings of all other lengths with no
effort). Each guess requires a decryption attempt (and subsequent
error-checking test) to confirm or deny. So their average effort is
2**(8*N)/2, no better than brute force.

LIT filecaps have the same property, but not derived from cryptography,
because there is no ciphertext. The attacker gets nothing, and is asked
to distinguish between hypothetical ciphertexts. If you reveal to me
that you have a LIT file (perhaps indirectly, by asking my storage
server for a mutable-directory share but then not fetching any immutable
shares immediately afterwards), then I can probably assume that it's
shorter that 65 bytes, but that leaves nearly 2**(8*65) possibilities,
and I have no way to distinguish between them (I don't even have a
SHA256 hash to use as an oracle). Clearly the attacker has nothing to
work with, so they can't do better than random chance. (they don't even
get length with LITs).

Of course, if you tell me that you have a secret file that's only 2
bytes long, then there aren't very many possibilities, so if I have some
other means to ask whether my guess is right or not, then I can figure
out your "secret" file without too much work. I'd bet you a zillion
dollars that I can guess your secret one-byte file in no more than 256
guesses, and your secret zero-byte file is even easier. I can name that
tune in zero notes if it's a work by John Cage :-).

> URI:LIT:ge3qaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Oh, and the right way to evaluate the "random enough"-ness is to look at
what the attacker gets to see, not what the filecap-holder sees. Two
filecaps that we see might be:

 a: URI:LIT:mzxw6cq
 b:
URI:CHK:onu7qbeeukr7hzijiuez27nmfi:jra4rqppihn6k4ki5tovlodr677nnszd255zzbh6ysjiijiluddq:3:10:291

what the attacker (or storage server) sees is basically:

 a: (nothing)
 b:
URI:CHK-Verify:yd2musxnsi5lverlvf3hidzgcy:jra4rqppihn6k4ki5tovlodr677nnszd255zzbh6ysjiijiluddq:3:10:291

The "onu7q" encryption key is the thing that must remain unguessable,
and the "yd2mus" storage-index is the thing that the attacker gets to
use to try and guess it. Those strings must be long enough to be secure.
The would-be LIT-cap attacker gets nothing.

Huh, if the LIT file didn't base32-encode the data, this property might
be even more obvious:

 % echo -n "here_is_my_secret" |tahoe put -
 URI:LIT:here_is_my_secret

The base32 encoding ("code", not "crypt") is necessary, of course, but
it's interesting to see how it smells of security, when in fact it is
merely there to let short Tahoe files contain arbitrary 8-bit data but
Tahoe filecaps continue to be ascii-safe.

So, in short, LIT caps are just as secure as CHK caps, because the
attacker never gets to see caps. LIT caps are even more secure than CHK,
because attckers don't get error-checking information or ciphertext. But
small files are just as guessable inside Tahoe as they are anywhere
else.

cheers,
 -Brian

[1]: a furniture broker would, of course, be a middleman who negotiates
     the complex world of furniture sales, matching up buyers with
     sellers, because furniture, like stocks, bonds, and health
     insurance plans, are too complicated to simply buy from a store. I
     very much hope that these people do not actually exist.
[2]: as the PyCon talk comparing Git and hg pointed out: SVN is our
     common enemy, we must destroy them