[tahoe-dev] Keeping local file system and Tahoe store in sync
shawn-tahoe at willden.org
Tue Feb 3 17:11:24 PST 2009
On Tuesday 03 February 2009 04:47:15 pm Brian Warner wrote:
> A few notes about the differences that Shawn mentioned:
> > 4. Backup of metadata in addition to file contents. Permissions, ACLs,
> > resource forks, etc. My ultimate goal is to be able to do whole-system
> > backups and restores, so this is essential.
> I'd like to get these included in "tahoe backup".. if you run into some
> code which can extract these extra pieces of metadata, please let me know..
> I've got a stubbed out function waiting for it. I don't know about resource
> forks, though, they should be attached to the filenode, not the parent
> directory entry. But, with read-only snapshots, I guess there isn't much
Take a look in the rdiff-backup source code. It handles extended attributes
and ACLs. It also handles resource forks -- poorly. The problem with its
handling of the resource forks is that it base-64 (or something) encodes them
and tosses them into the text-formatted metadata log file.
This results in some lines in the file which are megabytes in length on my
wife's Mac and is the source of EXTREMELY poor performance. I couldn't
figure out why rdiff-backup ran so well on all my other computers and sucked
so hard on hers, but that was the reason. Rather than fix rdiff-backup I
just stopped trying to back up her Applications directory which is where all
of the big resource forks lived.
I plan to handle resource forks as a separate file, associated with the data
fork by referencing both of them from the single backuplog entry.
> > 6. A focus on the issue of initial, large uploads. A backup session can
> > be terminated and resumed, and reasonable timestamping of backups is
> > maintained to facilitate a future "Time Machine"-like view.
> Eeeyah, that's a good point. In my scheme, if you interrupt a backup
> process (including the original one), you wind up with nothing in your
> Tahoe-side Backups/ directory, since the new snapshot is only attached to
> Backups/ at the very end. This also may make progress harder to track: some
> people want their Backups/ directory to get noticably larger as the process
> cranks away. On the other hand, all the files that are being uploaded will
> get stashed in the backupdb, and eventually the directories too, so you
> won't lose any actual progress by interrupting the snapshot and starting a
> new one.
I'm probably being excessively picky about this area, but it really bugs me
that a backup may span days or weeks (or perhaps never finish!). As a
practical matter, that's unavoidable, but it means that the backup isn't
anything like a snapshot, it's some weird view of different parts of the file
system at different times. It doesn't bother me so much that my "snapshot"
covers minutes, but days or weeks really annoys me.
My solution is to "scan fast, upload slow". The snapshot then spans the time
required to scan the file system, including hashing the files that need to be
hashed, which isn't fast but is unavoidable. When the scan is done, upload
jobs are generated and the backuplog, which represents the state of the file
system, is uploaded. If I understand immutable read caps correctly (I need
to go read that code), I should be able to upload the log which contains all
of the read caps before uploading the files those read caps reference. Then
I start actually processing the file upload jobs.
If I encounter a file which has changed since being scanned, it gets skipped,
leaving a broken link in the uploaded backuplog. That's okay, because the
next scan will catch that it has changed and will queue it for upload
again -- and will put it near the front of the line, because the upload queue
is sorted by mtime time on the theory that the oldest stuff is the most
stable stuff and the least likely to change or disappear before it can be
tucked safely away.
I'm still noodling on how to detect and handle files that change frequently.
I have some simple ideas that will catch the common cases, but there are
potential pathological cases that could slip through the cracks. I'm still
thinking on those.
This all probably appears needlessly complex, but it has some nice properties,
and the fact that backups are interruptable is just one of them. Another is
the ability to be as smart as you want about upload queueing. I'm just
sorting by mtime at the moment, but perhaps size makes sense to consider as
well, and there's obviously a lot of value in prioritizing /home
(or, "Documents and Settings") over other stuff. There's more, and I suspect
other benefits I haven't yet noticed. Batch, pipelined architectures have
been popular for ~50 years for lots of reasons.
More information about the tahoe-dev