Why encrypt first?

Brian Warner warner at lothar.com
Mon Aug 3 16:41:21 UTC 2015


On 8/3/15 4:47 AM, Adam Hunt wrote:

> One part of Tahoe-LAFS' design that I'm particularly curious about is
> why each file is encrypted in its entirety prior to "chunking" (my
> term).

Good question!

> Wouldn't it make more sense to fragment/chunk the file /then/ encrypt
> each fragment/chunk/segment?

Given the way Tahoe encrypts/encodes files, the order doesn't matter. To
be specific, what tahoe does is:

* choose an (AES) encryption key
* encrypt the file (CTR mode, starting at offset=0)
* split the ciphertext up into segments
* erasure-code each segment into "blocks"
* concatenate the first block of each segment into "share 0"
* concatenate the second block of each segment into "share 1", etc
* deliver one share to each server (ideally)

In practice, we work one segment at a time, so we never need to hold
more than a single segment in memory at once (this enables the
encryption of GB-sized files without gigabytes of RAM).

CTR ("Counter") mode allows random-access: you could encrypt the *last*
segment of the file first, if you wanted. So we could
encrypt-then-segment, or we could segment-then-encrypt, and the results
would be the same.

> I can see a few possible benefits to this order of operation:
> 
>  1. In the case of file which is inherently linear (e.g. a large media
>     file), the segments could be requested in order allowing the file
>     to be accessed as it is retrieved. This would make it possible to,
>     say, begin watching a large video file prior to the entire file
>     being retrieved. It might also be possible to seek to a point in
>     the file in question prior to the intervening segments being
>     received. Such features would be useful in a VOD (Video On Demand)
>     scenario.

Yep, we've already got that :). Tahoe can retrieve any individual byte
of the file with at most one segment's worth of time/bandwidth/CPU
overhead (we call this "low alacrity"). The encoding scheme needs two
things to achieve low alacrity:

* the ability to fetch/decode/decrypt just a small amount of data
* the ability to validate just a small amount of data

We download just a single segment at a time, un-erasure-code it, then
decrypt it. CTR mode would let us decrypt just a single byte at a time,
it's the erasure-coding that imposes a segment-at-a-time lower limit.

The validation is enabled by the Merkle hash tree: we store the expected
hash of each segment of ciphertext in a list, then we hash each pair of
the list into a shorter (half-size) list, then we hash *those* pairs,
etc, until we wind up with a single root hash. This root hash is
(effectively) copied into the filecap. From the roothash, it only takes
ln2(filesize) hashes to correctly validate any segment of the
ciphertext.

The immutable-file downloader has several layers. The top-most layer
returns arbitrary ranges of plaintext by fetching and decrypting the
same arbitrary ranges of ciphertext. The next layer down takes an
arbitrary range and figures out which segments you want. The next layer
down fetches and validates a single segment by un-erasure-coding a
necessary set of blocks. The bottom layer is asked for a set of blocks
and locates/queries the servers that can provide them.

>  1. Another possibility that such a scheme would potentially allow for
>     is each segment to be encrypted using a different key. Such
>     feature may present issues with the "key-in-URL" nature of
>     Tahoe-LAFS but I don't imagine such a detail is insurmountable.

I don't see how one-key-per-segment would help too much for immutable
files, but that mode is absolutely vital for our larger "MDMF" mutable
files. In MDMF, each segment gets its own salt, which is used to derive
a separate key for each segment. We need this because you can change the
contents, and it's a big crypto no-no to encrypt two different things
(in this case, two different versions of the same segment of the same
file) with the same key. The most common failure is that you reveal the
XOR of the two plaintexts. So we effectively do what you said for MDMF
mutable files. The smaller SDMF mutable files have only one segment, and
use a different key each time. Immutable files never change, so the
key-reuse problem doesn't show up.

cheers,
 -Brian


More information about the tahoe-dev mailing list