[tahoe-dev] Thinking about building a P2P backup system
Ian Levesque
ian at ephemeronindustries.com
Fri Jan 9 07:40:42 PST 2009
This really does seem to be reinventing rdiff-backup. Sure you can't
save yourself a lot of effort by modifying that?
-Ian
On Jan 9, 2009, at 1:49 AM, Shawn Willden wrote:
> On Thursday 08 January 2009 09:17:40 am zooko wrote:
>> Have you seen this thread? It might be a good project for you, as it
>> is self-contained, requires minimal changes to the tahoe core itself,
>> and is closely related to your idea about good backup:
>>
>> http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
>
> I think it's exactly what I want to do. Not exactly the way I want
> to do it,
> but close.
>
> Here's an overview of how I think I'll approach the problem. All
> suggestions
> welcome.
>
> First, I'll run the backup as two separate processes. A scanner,
> which
> determines what to back up, and an uploader, which does the job of
> backing
> things up.
>
> There are several reasons for separating the two tasks, but the
> biggest ones
> are (a) I expect there will ultimately be several different scanners
> and (b)
> because a large backup will take a long time (especially an initial
> backup),
> the upload process needs to be resumable anyway, so there will have
> to be
> some way of tracking where it is in the process. Having an upload
> queue
> generated by the scanner and written to disk is a good way. An
> interrupted
> scan can simply be restarted from the beginning, since it's
> relatively cheap.
>
> Change detection will of course be based on size, mtime and ctime.
> ctime
> needs to be examined to catch metadata changes. The scanner will
> check these
> values against a backupdb (more on that below). If a full "hash
> check" is
> desired, the scanner can simply queue every file for the uploader.
>
> The uploader will read from the queue and compute an rdiff signature
> for each
> file. If the file is in the backupdb, the uploader will check the
> computed
> rdiff signature against the stored signature. If they differ, it
> will use
> the stored signature to compute an rdiff delta and upload that. It
> will also
> store the current metadata in the backupdb.
>
> To avoid missing changes due to limited mtime granularity the stored
> metadata
> will have mtime set to min(mtime, start-1), where "start" is the
> time that
> the uploader started reading the file to compute the rdiff signature.
>
> There is another possible issue, a serious one. If the file changes
> between
> the calculation of the rdiff signature and generation of the delta
> the chain
> of forward deltas will be broken. To avoid that, the uploader will
> check the
> mtime after generating the delta and if it is greater than or equal
> to "start", the uploader will calculate a new rdiff while copying
> the file to
> a temporary location, compute the delta from the temporary and use
> those
> rdiff and delta values. The copy is only really necessary if the
> file is
> going to continue changing, but since it changed once while the
> uploader was
> working, it's reasonable to assume that it might be a log file or
> something
> else that's changing frequently.
>
> When the uploader has exhausted the queue, it then uploads the
> backupdb. In
> order to make that efficient, for incremental backups it uploads a
> delta of
> the backupdb.
>
> I'm still thinking about how to structure the directories into which
> all this
> stuff gets uploaded. I think it'll probably be a single tree
> structure that
> mirrors the backed-up filesystem structure, but each file will
> become a
> subdirectory, into which the original plus deltas will be placed. The
> backupdb and its deltas will go just above the three. Oh, and all
> of the
> various uploads mentioned are immutable.
>
> The downside of using forward deltas is that you have to have all of
> them, in
> a complete unbroken chain, in order to get from the original to any
> other
> version including the most recent. This means that the longer the
> chain of
> deltas, the less reliable the most recent version is, and that means
> that
> we'll also need another maintenance process, one that has to have
> the ability
> to retrieve the files and the deltas and reconstruct them all so it
> can
> generate the most recent version and then generate reverse deltas,
> probably
> discarding some of the backup intervals.
>
> There are lots of details to fill in, but that's the outline.
> Comments?
>
> Shawn.
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
More information about the tahoe-dev
mailing list