[tahoe-dev] Keeping local file system and Tahoe store in sync
Shawn Willden
shawn-tahoe at willden.org
Fri Feb 6 12:52:56 PST 2009
On Tuesday 03 February 2009 06:40:13 pm Brian Warner wrote:
> So, given a file on disk, you have to do almost the entire Tahoe upload
> process to find out what the eventual Tahoe readcap is going to be. This
> sounds like it's at odds with your plan to upload the "backuplog" before
> you finish uploading some of the actual data files.
Okay, so I think I've settled on a solution for this. The design goals were:
(1) Don't delay uploading of the backuplog any longer than necessary, or slow
down file scanning any more than necessary. This is to preserve
the "snapshot-ness" of the backup. As I've said, I probably care way more
about that than it deserves, but nobody's tried to talk me out of it yet.
(2) Get backed-up files fully online in a reasonable amount of time. That
means that we shouldn't delay too long after uploading a file to upload the
information needed to bind the hash (from the backuplog) to the ability to
retrieve the file (the read cap).
(3) In case of no local caching, minimize the amount of information that has
to be downloaded to retrieve a file. Yes, the whole backuplog has to be
downloaded (at least until random access to immutable files is implemented),
but don't require too much more than that.
(4) Minimize the amount of extra uploading that has to be done. There will be
some, because there has to be information in the grid to map from backup
hashes to read caps.
(5) Exploit the need to store read caps to the files to also link in the read
caps of the rdiff signature files.
So what I think I'm going to do is to create read cap index files which
contain multiple read caps per file, but not so many that it's expensive to
upload a new version of a particular file just to add one more cap. Since
each read cap is 56 bytes (assuming I added that up right), and since I also
want to include the last 40 bytes of the rdiff signature read cap, each entry
will be 96 bytes in size, plus a byte or two for bookkeeping.
I'm initially assuming that each file will contain up to 64 entries (a 6 KiB
file). I expect comments from Brian and Zooko, as well as testing, may alter
that parameter. Oh, there's no reason it has to be a power of 2. I just
like powers of 2.
The read caps in the file will be sorted, but the files are small enough to be
linearly searched anyway, as long as you can efficiently find the right file
to search, which you can.
For large file systems, though, that's still a lot of read cap index files. A
million-file FS will need over 15,000 such files, and a million files isn't
even that large ('sudo ls -R / | wc' reports 964,360 files on my desktop
machine). Placing them all in one directory means huge dirnode, which must
be reuploaded every time one of them changes.
So, I'm going to create a dirnode tree to hold them. I don't want it too
deep, because that adds dirnode creation and traversal overhead, and I don't
want it too flat, because that makes for large dirnodes, which are expensive
to update. So, I picked a fanout value (which is adjustable but does have to
be a power of 2) of 64. It starts out as a one-level tree and gets deeper as
files get full.
I wrote a prototype of the tree (called a RadixTree) last night and it seems
to work very well. Thanks to the fact that the keys are the leading bits of
SHA256 hash values, the tree stays nicely balanced even though the code does
nothing to balance it.
So here's an outline of the backup process as I see it now:
(1) File system scanner identifies files that may have changed, based on file
system metadata and the previous backuplog[1] (to do a full hash check, the
scanner is just configured to report everything as a potential change).
(2) Change detector hashes possibly-changed files to identify truly-changed
files, and generates a backuplog and an upload work queue. The backuplog is
closed and uploaded. It contains file content hashes plus metadata.
(3) Uploader prioritizes upload jobs and starts working through them. For
each file it uploads[2], it also generates the rdiff signature[3] and uploads
that as well, but with a "random" encryption key -- the hash of the original
file. Then it assembles the read cap entry (file read cap + signature read
cap w/o hash) and adds it to another upload queue.
(4) Based on some heuristics, initially just periodically (maybe hourly), the
read cap uploader will update read cap files in the tree. Though I haven't
prototyped it yet, it should be relatively simple to do it without uploading
or downloading any files that don't change.
Does that seem workable? Are you wondering if I'm getting into crazy levels
of complexity for little gain?
Anyway, lunch break's over; back to work.
Shawn
[1] I've also been playing with using the FSEvents log on Mac OS X to avoid
having to scan the file system, and I've read that there's a way to get
access to the NTFS journal so a system service can record file system changes
that way. It should be possible on those platforms to get away from having
to actually scan the file system. Linux doesn't seem to have anything
equivalent, unfortunately (inotify is fine if you only care about a small
portion of the FS).
[2] The files uploaded may be full revisions or diffs. In either case, the
encryption key used will be the full revision hash.
[3] I'm not going to generate and upload rdiff signatures for all files.
Below a certain size threshold (I don't know what that threshold is), it's
faster to simply upload full revisions every time. So for files below the
size threshold, I won't bother creating rdiff signatures and will instead
just fill that portion of the read cap file entry with zeros.
More information about the tahoe-dev
mailing list