[tahoe-dev] GridBackup functional (barely)
Shawn Willden
shawn-tahoe at willden.org
Thu Jul 2 23:41:39 PDT 2009
After far too long, GridBackup works, for a suitable definition of "works".
Specifically, it scans the file system, produces and stores the backup logs
(full and incremental) containing file metadata, and generates a queue of
upload jobs to back up the actual file content.
It also backs up the file content, and produces the read cap index entries
that links the backup logs to the content. When doing backups, it packs
small files together to help speed up the upload process. Very small files
are actually stored in the read cap index directly, and tiny files (including
empty files) are stored in the backup log. It also prioritizes the uploads,
favoring recently-modified files, small files and user files. The favoritism
for recently-modified files in particular tends to mean that the stuff you're
actually working on gets uploaded very early.
Both the backup scanning and upload processes are completely interruptable and
restartable. The initial full scan is slow (gotta hash your whole drive,
basically), but subsequent incremental scans are pretty fast. Uploading is
slow, but that's pretty much unavoidable (though I do have some plans to
improve it a little via parallelism).
I am actually going to begin using GridBackup for my daily backups tomorrow,
after I get an allmydata.com account and figure out how to point my local
Tahoe node at it. In a couple of months I plan to set up a friendnet and
switch to using that.
What doesn't work yet is:
1. Difference-based backups. All of the infrastructure is in place to do
them, and adding that functionality should be nearly trivial, but I want to
focus on testing the basic backup functionality for a while.
2. Restores. Yeah, this is a rather important bit :-). It's my next major
task. It also shouldn't be tremendously difficult, though, since it will use
all of the code used to create the logs, read cap index, etc., to parse them.
3. Cross-platform support. All development and testing has been done on
Linux, and while there is some stuff in place to handle Windows and OS X
issues, testing, debugging and, I'm sure, fixing is required.
4. GUI. Right now the UI consists of two scripts (GridBackup and GridUpload)
and a config file (~/.GridBackup/config.ini). The scripts don't even have
any command line options; you have to configure everything through the config
file.
Somewhere in there I'm also going to do a major code restructuring. I've
learned a lot about how Python code should be put together so the current
codebase is an inconsistent mishmash of approaches.
Other items that are even further away are: slick platform-appropriate
installers that install and configure both GridBackup and Tahoe; integration
of GridBackup and GridUpload into a single, daemonable (twistd) process that
uses twisted to achieve some level of upload parallelism to improve
performance; and other file system scanners, including one for OS X that uses
fsevents, and a cross-platform scanner that operates remotely, to enable
a 'backup server'.
Anyway, I have lots more work to do, but the code actually successfully makes
backups now.
Anyone who is interested can grab the current code from my github-hosted repo,
at:
http://github.com/divegeek/GridBackup
If you decide to play with it, you'll almost certainly run into problems.
Feel free to ask questions here, or mail me directly. I'll try to actually
pay a little attention to my IRC client which hangs out on #tahoe, too.
Shawn.
More information about the tahoe-dev
mailing list