[tahoe-dev] Keeping local file system and Tahoe store in sync

Tue Feb 3 17:11:24 PST 2009

On Tuesday 03 February 2009 04:47:15 pm Brian Warner wrote:
> A few notes about the differences that Shawn mentioned:
> > 4.  Backup of metadata in addition to file contents. Permissions, ACLs,
> > resource forks, etc. My ultimate goal is to be able to do whole-system
> > backups and restores, so this is essential.
>
> I'd like to get these included in "tahoe backup".. if you run into some
> code which can extract these extra pieces of metadata, please let me know..
> I've got a stubbed out function waiting for it. I don't know about resource
> forks, though, they should be attached to the filenode, not the parent
> directory entry. But, with read-only snapshots, I guess there isn't much
> difference.

Take a look in the rdiff-backup source code.  It handles extended attributes 
and ACLs.  It also handles resource forks -- poorly.  The problem with its 
handling of the resource forks is that it base-64 (or something) encodes them 
and tosses them into the text-formatted metadata log file.

This results in some lines in the file which are megabytes in length on my 
wife's Mac and is the source of EXTREMELY poor performance.  I couldn't 
figure out why rdiff-backup ran so well on all my other computers and sucked 
so hard on hers, but that was the reason.  Rather than fix rdiff-backup I 
just stopped trying to back up her Applications directory which is where all 
of the big resource forks lived.

I plan to handle resource forks as a separate file, associated with the data 
fork by referencing both of them from the single backuplog entry.

> > 6.  A focus on the issue of initial, large uploads. A backup session can
> > be terminated and resumed, and reasonable timestamping of backups is
> > maintained to facilitate a future "Time Machine"-like view.
>
> Eeeyah, that's a good point. In my scheme, if you interrupt a backup
> process (including the original one), you wind up with nothing in your
> Tahoe-side Backups/ directory, since the new snapshot is only attached to
> Backups/ at the very end. This also may make progress harder to track: some
> people want their Backups/ directory to get noticably larger as the process
> cranks away. On the other hand, all the files that are being uploaded will
> get stashed in the backupdb, and eventually the directories too, so you
> won't lose any actual progress by interrupting the snapshot and starting a
> new one.

I'm probably being excessively picky about this area, but it really bugs me 
that a backup may span days or weeks (or perhaps never finish!).  As a 
practical matter, that's unavoidable, but it means that the backup isn't 
anything like a snapshot, it's some weird view of different parts of the file 
system at different times.  It doesn't bother me so much that my "snapshot" 
covers minutes, but days or weeks really annoys me.

My solution is to "scan fast, upload slow".  The snapshot then spans the time 
required to scan the file system, including hashing the files that need to be 
hashed, which isn't fast but is unavoidable.  When the scan is done, upload 
jobs are generated and the backuplog, which represents the state of the file 
system, is uploaded.  If I understand immutable read caps correctly (I need 
to go read that code), I should be able to upload the log which contains all 
of the read caps before uploading the files those read caps reference.  Then 
I start actually processing the file upload jobs.

If I encounter a file which has changed since being scanned, it gets skipped, 
leaving a broken link in the uploaded backuplog.  That's okay, because the 
next scan will catch that it has changed and will queue it for upload 
again -- and will put it near the front of the line, because the upload queue 
is sorted by mtime time on the theory that the oldest stuff is the most 
stable stuff and the least likely to change or disappear before it can be 
tucked safely away.

I'm still noodling on how to detect and handle files that change frequently.  
I have some simple ideas that will catch the common cases, but there are 
potential pathological cases that could slip through the cracks.  I'm still 
thinking on those.

This all probably appears needlessly complex, but it has some nice properties, 
and the fact that backups are interruptable is just one of them.  Another is 
the ability to be as smart as you want about upload queueing.  I'm just 
sorting by mtime at the moment, but perhaps size makes sense to consider as 
well, and there's obviously a lot of value in prioritizing /home 
(or, "Documents and Settings") over other stuff.  There's more, and I suspect 
other benefits I haven't yet noticed.  Batch, pipelined architectures have 
been popular for ~50 years for lots of reasons.

	Shawn.