[tahoe-dev] GridBackup functional (barely)

Thu Jul 2 23:41:39 PDT 2009

After far too long, GridBackup works, for a suitable definition of "works".

Specifically, it scans the file system, produces and stores the backup logs 
(full and incremental) containing file metadata, and generates a queue of 
upload jobs to back up the actual file content.

It also backs up the file content, and produces the read cap index entries 
that links the backup logs to the content.  When doing backups, it packs 
small files together to help speed up the upload process.  Very small files 
are actually stored in the read cap index directly, and tiny files (including 
empty files) are stored in the backup log.  It also prioritizes the uploads, 
favoring recently-modified files, small files and user files.  The favoritism 
for recently-modified files in particular tends to mean that the stuff you're 
actually working on gets uploaded very early.

Both the backup scanning and upload processes are completely interruptable and 
restartable.  The initial full scan is slow (gotta hash your whole drive, 
basically), but subsequent incremental scans are pretty fast.  Uploading is 
slow, but that's pretty much unavoidable (though I do have some plans to 
improve it a little via parallelism).

I am actually going to begin using GridBackup for my daily backups tomorrow, 
after I get an allmydata.com account and figure out how to point my local 
Tahoe node at it.  In a couple of months I plan to set up a friendnet and 
switch to using that.

What doesn't work yet is:

1.  Difference-based backups.  All of the infrastructure is in place to do 
them, and adding that functionality should be nearly trivial, but I want to 
focus on testing the basic backup functionality for a while.

2.  Restores.  Yeah, this is a rather important bit :-).  It's my next major 
task.  It also shouldn't be tremendously difficult, though, since it will use 
all of the code used to create the logs, read cap index, etc., to parse them.

3.  Cross-platform support.  All development and testing has been done on 
Linux, and while there is some stuff in place to handle Windows and OS X 
issues, testing, debugging and, I'm sure, fixing is required.

4.  GUI.  Right now the UI consists of two scripts (GridBackup and GridUpload) 
and a config file (~/.GridBackup/config.ini).  The scripts don't even have 
any command line options; you have to configure everything through the config 
file.

Somewhere in there I'm also going to do a major code restructuring.  I've 
learned a lot about how Python code should be put together so the current 
codebase is an inconsistent mishmash of approaches.

Other items that are even further away are: slick platform-appropriate 
installers that install and configure both GridBackup and Tahoe; integration 
of GridBackup and GridUpload into a single, daemonable (twistd) process that 
uses twisted to achieve some level of upload parallelism to improve 
performance; and other file system scanners, including one for OS X that uses 
fsevents, and a cross-platform scanner that operates remotely, to enable 
a 'backup server'.

Anyway, I have lots more work to do, but the code actually successfully makes 
backups now.

Anyone who is interested can grab the current code from my github-hosted repo, 
at:

	http://github.com/divegeek/GridBackup

If you decide to play with it, you'll almost certainly run into problems.  
Feel free to ask questions here, or mail me directly.  I'll try to actually 
pay a little attention to my IRC client which hangs out on #tahoe, too.

	Shawn.