[tahoe-dev] [tahoe-lafs] #684: let client specify the encryption key

Sun May 24 13:48:34 PDT 2009

On Sunday 24 May 2009 11:12:38 am Zooko Wilcox-O'Hearn wrote:
> On May 24, 2009, at 10:27 AM, Shawn Willden wrote:
> > At present, I don't think I do.  It allowed a useful space
> > optimization for my read cap index files, but for other reasons
> > I've done away with that.
>
> Could you tell me more about how it allowed space optimization?  (I
> can think of a way, but I'm curious how you did it.)  Also could you
> explain your reasons not to use that space optimization technique
> after all?

I need a way to map content hashes to read caps, because my backup log 
contains content hashes.  I can't put read caps in the backup log for a 
couple of reasons which should become clear below.

My solution to that problem is to construct "read cap index" files, which are 
essentially lists of (hash,cap) tuples, sorted by hash and placed in a "burst 
tree" structure that makes both access and storage efficient.

Since the first bytes of a read cap is a content hash, it occurred to me that 
perhaps I could simply omit the hash part of the tuple.  There are a couple 
of reasons why that doesn't work with the Tahoe read caps.

First, CHKs are computed as:

	H(params || conv || content)

This means that the CHK in the read cap is dependent upon the FEC encoding 
parameters.  That makes sense for read caps, but I don't want the backup log 
to be concerned with such details.  So, I wanted to compute my CHKs as:

	H(params || conv) XOR H(content)

H(content) is what is stored in the backup log.  My read cap index file would 
then contain the read cap, but with H(params || conv) XOR'd out.

Second, deltas.  During the scan and generation of the backup log, it's not 
necessarily known whether the backup will be stored as a full version or a 
delta.  So the backup log just contains the full file content hash.  If a 
delta is what actually gets backed up, though, the Tahoe-generated CHK will 
be a hash of the delta, not the original file.  I had a clever scheme for 
encoding source and destination hashes into the delta hash which I don't 
recall right now (and my notes are a few hundred miles away).

The main reason I abandoned it was because my delta hash generation scheme 
wasn't clever enough.  Since there's no way to tell from the backup logs 
which pair of versions were used to generate a delta, the process of 
searching for deltas was too complicated and required reading too much of the 
index tree, especially since there's no certainty that what you're searching 
for even exists.  I could have worked around that by storing some more 
information, but that started cutting into the space saved by the 
optimization.

Overall, I think I would still gain with the optimization, but I decided the 
benefit isn't worth the effort and complexity.  Instead I'm just using 
(hash,cap) tuples.  I'll store them even for missing files (versions that 
never manage to get backed up because the source file changes before the job 
queue gets around to uploading them), with an empty read cap so that it's 
easy to identify missing files.

Hmm.  It just occurred to me that I could do that even with the optimization, 
which would solve part of the delta-finding problem... no, not going there.  
Keep it simple, and the space cost isn't that large.

> But, if you can provide other examples of how people writing atop
> Tahoe might mess up, I would really like to hear it.  Your experience
> in actually doing so (writing, that is, not messing-up) are valuable
> and I'd love to get some notes from you while they are still
> relatively fresh in your mind.

Heh.  We should have had this discussion in February, then.  I don't remember 
right now what other issues I came across.  I'll post them as they come to 
mind, though.

	Shawn.