[tahoe-dev] Thinking about building a P2P backup system
Shawn Willden
shawn-tahoe at willden.org
Thu Jan 8 22:49:16 PST 2009
On Thursday 08 January 2009 09:17:40 am zooko wrote:
> Have you seen this thread? It might be a good project for you, as it
> is self-contained, requires minimal changes to the tahoe core itself,
> and is closely related to your idea about good backup:
>
> http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
I think it's exactly what I want to do. Not exactly the way I want to do it,
but close.
Here's an overview of how I think I'll approach the problem. All suggestions
welcome.
First, I'll run the backup as two separate processes. A scanner, which
determines what to back up, and an uploader, which does the job of backing
things up.
There are several reasons for separating the two tasks, but the biggest ones
are (a) I expect there will ultimately be several different scanners and (b)
because a large backup will take a long time (especially an initial backup),
the upload process needs to be resumable anyway, so there will have to be
some way of tracking where it is in the process. Having an upload queue
generated by the scanner and written to disk is a good way. An interrupted
scan can simply be restarted from the beginning, since it's relatively cheap.
Change detection will of course be based on size, mtime and ctime. ctime
needs to be examined to catch metadata changes. The scanner will check these
values against a backupdb (more on that below). If a full "hash check" is
desired, the scanner can simply queue every file for the uploader.
The uploader will read from the queue and compute an rdiff signature for each
file. If the file is in the backupdb, the uploader will check the computed
rdiff signature against the stored signature. If they differ, it will use
the stored signature to compute an rdiff delta and upload that. It will also
store the current metadata in the backupdb.
To avoid missing changes due to limited mtime granularity the stored metadata
will have mtime set to min(mtime, start-1), where "start" is the time that
the uploader started reading the file to compute the rdiff signature.
There is another possible issue, a serious one. If the file changes between
the calculation of the rdiff signature and generation of the delta the chain
of forward deltas will be broken. To avoid that, the uploader will check the
mtime after generating the delta and if it is greater than or equal
to "start", the uploader will calculate a new rdiff while copying the file to
a temporary location, compute the delta from the temporary and use those
rdiff and delta values. The copy is only really necessary if the file is
going to continue changing, but since it changed once while the uploader was
working, it's reasonable to assume that it might be a log file or something
else that's changing frequently.
When the uploader has exhausted the queue, it then uploads the backupdb. In
order to make that efficient, for incremental backups it uploads a delta of
the backupdb.
I'm still thinking about how to structure the directories into which all this
stuff gets uploaded. I think it'll probably be a single tree structure that
mirrors the backed-up filesystem structure, but each file will become a
subdirectory, into which the original plus deltas will be placed. The
backupdb and its deltas will go just above the three. Oh, and all of the
various uploads mentioned are immutable.
The downside of using forward deltas is that you have to have all of them, in
a complete unbroken chain, in order to get from the original to any other
version including the most recent. This means that the longer the chain of
deltas, the less reliable the most recent version is, and that means that
we'll also need another maintenance process, one that has to have the ability
to retrieve the files and the deltas and reconstruct them all so it can
generate the most recent version and then generate reverse deltas, probably
discarding some of the backup intervals.
There are lots of details to fill in, but that's the outline. Comments?
Shawn.
More information about the tahoe-dev
mailing list