[tahoe-dev] Removing the dependency of immutable read caps on UEB computation

Sun Oct 4 17:49:23 PDT 2009

On Sunday 04 October 2009 02:25:53 pm Brian Warner wrote:
> So, one suggestion that follows would be to store the immutable
> "share-update" cap in the same dirnode column that contains writecaps
> for mutable files.

Perhaps.  I think re-encoding caps would have more a more specialized purpose.  
They would be needed by a repair system.

> I suppose it's possible to have a re-encoding cap
> which doesn't also provide the ability to read the file

The re-encoding caps I described would not provide the ability to decrypt the 
file, only to re-encode the ciphertext.

> in which case 
> the master cap that lives above both re-encoding- and read- caps could
> be called the read-and-re-encode-cap, or something).

read-and-re-encode-and-verify.  'master' is much shorter :)

> I still don't follow. You could hash+encrypt+FEC, produce shares, hash
> the shares, produce the normal CHK readcap, and then throw away the
> shares (without ever touching the network): this gives you caps for
> files that haven't been uploaded to the grid yet.

But you also have to decide what encoding parameters to use.

I want to separate that decision, because I want to allow encoding decisions 
to be made based on reliability requirements, performance issues, grid size 
and perhaps even server reliability estimates.  Many of those factors are 
only known at the point of upload.

> Hm, we're assuming a model in which the full file is available to some
> process A, and that there is a Tahoe webapi-serving node running in
> process B, and that A and B communicate, right? So part of the goal is
> to reduce the amount of data that goes between A and B? Or to make it
> possible for A to do more stuff without needing to send a lot of data to
> node B?

> In that case, I'm not sure I see as much of an improvement as you do. A
> has to provide B with a significant amount of uncommon data about the
> file to compute the FEC-less readcap: A must encrypt the file with the
> right key, segment it correctly (and the segment size must be a multiple
> of 'k'), build the merkle tree, and then deliver both the flat hashes
> and the whole merkle tree. This makes it sounds like there's a
> considerable amount of Tahoe-derived code running locally on A (so it
> can produce this information in the exact same way that B will
> eventually do so). In fact it starts to sound more and more like a
> Helper-ish relationship: some Tahoe code on A, some other Tahoe code
> over on B.

Hmm.  I didn't realize that segment size was dependent on 'k'.  I thought 
segments were fixed at 128 KiB?  Or is that buckets?  Or blocks?  I'm still 
quite hazy on the precise meaning of bucket and block.

This is a very good point, though.  I wouldn't want 'A' to have to understand 
Tahoe's segmentation decisions.  I'm not sure why it feels acceptable to have 
it know Tahoe's encryption and hash tree generation in detail, but not 
segmentation.  Maybe because segment sizes have changed in the past and it 
seems reasonable that they might change again in the future -- perhaps even 
get chosen dynamically at some point?

It's probably better to assume that all of this knowledge is only in Tahoe and 
the client has to provide the plaintext in order to get a cap.

> (hey, wouldn't it be cool if local filesystems
> would let you store a bit of metadata about the file which would be
> automatically deleted if the file's contents were changed?)

That *would* be cool.

> Hm, it sounds like some of the use case might be addressed by making it
> easier to run additional code in the tahoe node (i.e. a tahoe plugin),
> which might then let you move "B" over to where "A" is, and then
> generally tell the tahoe node to upload/examine files directly from disk
> instead of over an HTTP control+data channel.

That would be very useful.  I have to make copies of files before uploading 
them anyway, so that they don't change while uploading (because I map the 
file content hash to a read cap, so I need to make absolutely sure that the 
file uploaded is the same one I hashed), and then Tahoe has to make another 
copy before it can encode, so being able to tell Tahoe where to grab it from 
the file system would reduce the number of copies by one.

On the "plugin" point, I'm thinking that I want to implement my backup server 
as a Tahoe plugin.  I'm not sure it makes sense to implement it as a part of 
Tahoe, because Tahoe is a more general-purpose system.  From a practical 
perspective, though, my backup server is (or will be) a Twisted application, 
it should live right next to a Tahoe node, and it should start up whenever 
the Tahoe node starts and stop whenever the Tahoe node stops.  Seems like a 
good case for a plugin.

	Shawn.