[tahoe-dev] Keeping local file system and Tahoe store in sync

Fri Feb 6 12:52:56 PST 2009

On Tuesday 03 February 2009 06:40:13 pm Brian Warner wrote:
> So, given a file on disk, you have to do almost the entire Tahoe upload
> process to find out what the eventual Tahoe readcap is going to be. This
> sounds like it's at odds with your plan to upload the "backuplog" before
> you finish uploading some of the actual data files.

Okay, so I think I've settled on a solution for this.  The design goals were:

(1) Don't delay uploading of the backuplog any longer than necessary, or slow 
down file scanning any more than necessary.  This is to preserve 
the "snapshot-ness" of the backup.  As I've said, I probably care way more 
about that than it deserves, but nobody's tried to talk me out of it yet.

(2) Get backed-up files fully online in a reasonable amount of time.  That 
means that we shouldn't delay too long after uploading a file to upload the 
information needed to bind the hash (from the backuplog) to the ability to 
retrieve the file (the read cap).

(3) In case of no local caching, minimize the amount of information that has 
to be downloaded to retrieve a file.  Yes, the whole backuplog has to be 
downloaded (at least until random access to immutable files is implemented), 
but don't require too much more than that.

(4) Minimize the amount of extra uploading that has to be done.  There will be 
some, because there has to be information in the grid to map from backup 
hashes to read caps.

(5) Exploit the need to store read caps to the files to also link in the read 
caps of the rdiff signature files.

So what I think I'm going to do is to create read cap index files which 
contain multiple read caps per file, but not so many that it's expensive to 
upload a new version of a particular file just to add one more cap.  Since 
each read cap is 56 bytes (assuming I added that up right), and since I also 
want to include the last 40 bytes of the rdiff signature read cap, each entry 
will be 96 bytes in size, plus a byte or two for bookkeeping.

I'm initially assuming that each file will contain up to 64 entries (a 6 KiB 
file).  I expect comments from Brian and Zooko, as well as testing, may alter 
that parameter.  Oh, there's no reason it has to be a power of 2.  I just 
like powers of 2.

The read caps in the file will be sorted, but the files are small enough to be 
linearly searched anyway, as long as you can efficiently find the right file 
to search, which you can.

For large file systems, though, that's still a lot of read cap index files.  A 
million-file FS will need over 15,000 such files, and a million files isn't 
even that large ('sudo ls -R / | wc' reports 964,360 files on my desktop 
machine).  Placing them all in one directory means huge dirnode, which must 
be reuploaded every time one of them changes.

So, I'm going to create a dirnode tree to hold them.  I don't want it too 
deep, because that adds dirnode creation and traversal overhead, and I don't 
want it too flat, because that makes for large dirnodes, which are expensive 
to update.  So, I picked a fanout value (which is adjustable but does have to 
be a power of 2) of 64.  It starts out as a one-level tree and gets deeper as 
files get full.

I wrote a prototype of the tree (called a RadixTree) last night and it seems 
to work very well.  Thanks to the fact that the keys are the leading bits of 
SHA256 hash values, the tree stays nicely balanced even though the code does 
nothing to balance it. 

So here's an outline of the backup process as I see it now:

(1)  File system scanner identifies files that may have changed, based on file 
system metadata and the previous backuplog[1] (to do a full hash check, the 
scanner is just configured to report everything as a potential change).

(2)  Change detector hashes possibly-changed files to identify truly-changed 
files, and generates a backuplog and an upload work queue.  The backuplog is 
closed and uploaded.  It contains file content hashes plus metadata.

(3)  Uploader prioritizes upload jobs and starts working through them.  For 
each file it uploads[2], it also generates the rdiff signature[3] and uploads 
that as well, but with a "random" encryption key -- the hash of the original 
file.  Then it assembles the read cap entry (file read cap + signature read 
cap w/o hash) and adds it to another upload queue.

(4)  Based on some heuristics, initially just periodically (maybe hourly), the 
read cap uploader will update read cap files in the tree.  Though I haven't 
prototyped it yet, it should be relatively simple to do it without uploading 
or downloading any files that don't change.

Does that seem workable?  Are you wondering if I'm getting into crazy levels 
of complexity for little gain?

Anyway, lunch break's over; back to work.

	Shawn

[1] I've also been playing with using the FSEvents log on Mac OS X to avoid 
having to scan the file system, and I've read that there's a way to get 
access to the NTFS journal so a system service can record file system changes 
that way.  It should be possible on those platforms to get away from having 
to actually scan the file system.  Linux doesn't seem to have anything 
equivalent, unfortunately (inotify is fine if you only care about a small 
portion of the FS).

[2] The files uploaded may be full revisions or diffs.  In either case, the 
encryption key used will be the full revision hash.

[3] I'm not going to generate and upload rdiff signatures for all files.  
Below a certain size threshold (I don't know what that threshold is), it's 
faster to simply upload full revisions every time.  So for files below the 
size threshold, I won't bother creating rdiff signatures and will instead 
just fill that portion of the read cap file entry with zeros.