[tahoe-dev] Removing the dependency of immutable read caps on UEB computation

Sun Oct 4 13:25:53 PDT 2009

Shawn Willden wrote:
> On Saturday 03 October 2009 01:26:16 am Brian Warner wrote:
>> Incidentally, we removed 1 and 2 forever ago, to squash the
>> partial-information-guessing-attack.
> 
> Makes sense.  The diagrams in the docs should be updated.

Yeah, I'll see if I can get to that today.

> Since this is for immutable files, there is currently no writecap or
> traversalcap, just a readcap and perhaps a verifycap. This scheme
> would require either adding either a share-update cap or providing a
> master cap (from which share-update and read caps could be computed).

So, one suggestion that follows would be to store the immutable
"share-update" cap in the same dirnode column that contains writecaps
for mutable files. Hm. Part of me says ok, part of me says that's bad
parallelism. Why should a mutable-directory writecap-holder then get
access to the re-encoding caps of the enclosed immutable files? Again,
it gets back to the policy decision that distinguishes re-encoding-cap
holders from read-cap holders: who would you give one-but-not-the-other
to, and why? When would you be willing to be vulnerable to [whatever it
is that a re-encoding cap allows] in exchange for allowing someone else
to help you with [whatever it is that a re-encoding cap allows]? That
sort of thing.

(incidentally, I'm not fond of the term "master cap", because it doesn't
actually convey what authorities the cap provides.. it just says that it
provides more authority than any other cap. "re-encoding cap" feels more
meaningful to me. I suppose it's possible to have a re-encoding cap
which doesn't also provide the ability to read the file, in which case
the master cap that lives above both re-encoding- and read- caps could
be called the read-and-re-encode-cap, or something).

>> Deriving the filecap without performing FEC doesn't feel like a huge
>> win to me.. it's just a performance difference in testing for
>> convergence, right?
> 
> No, it's more than that. It allows you to produce and store caps for
> files that haven't been uploaded to the grid yet. You can make a "this
> is where the file will be if it ever gets added" cap.

I still don't follow. You could hash+encrypt+FEC, produce shares, hash
the shares, produce the normal CHK readcap, and then throw away the
shares (without ever touching the network): this gives you caps for
files that haven't been uploaded to the grid yet. Removing the share
hashes just reduces the amount of work you have to do to get the readcap
(no FEC).

> Also, it would be possible to do it without the actual file contents,
> just the right hashes, which can make a huge performance difference in
> testing for convergence if the actual file doesn't have to be
> delivered to the Tahoe node doing the testing.

Hm, we're assuming a model in which the full file is available to some
process A, and that there is a Tahoe webapi-serving node running in
process B, and that A and B communicate, right? So part of the goal is
to reduce the amount of data that goes between A and B? Or to make it
possible for A to do more stuff without needing to send a lot of data to
node B?

In that case, I'm not sure I see as much of an improvement as you do. A
has to provide B with a significant amount of uncommon data about the
file to compute the FEC-less readcap: A must encrypt the file with the
right key, segment it correctly (and the segment size must be a multiple
of 'k'), build the merkle tree, and then deliver both the flat hashes
and the whole merkle tree. This makes it sounds like there's a
considerable amount of Tahoe-derived code running locally on A (so it
can produce this information in the exact same way that B will
eventually do so). In fact it starts to sound more and more like a
Helper-ish relationship: some Tahoe code on A, some other Tahoe code
over on B.

If you've got help from your local filesystem to compute and store those
uncommon hashes, then this might help. Or if you've got some other
system on that side (like, say, tahoe's backupdb) to remember things for
you, then it might work. But if you have those, why not just store the
whole filecap there? (hey, wouldn't it be cool if local filesystems
would let you store a bit of metadata about the file which would be
automatically deleted if the file's contents were changed?)

Hm, it sounds like some of the use case might be addressed by making it
easier to run additional code in the tahoe node (i.e. a tahoe plugin),
which might then let you move "B" over to where "A" is, and then
generally tell the tahoe node to upload/examine files directly from disk
instead of over an HTTP control+data channel.

still intrigued,
 -Brian