[tahoe-dev] Backup plans, part 1

Thu Jan 22 20:44:52 PST 2009

Here are my current plans for handling incremental backups.  I'm
currently prototyping the approach with some code that backs up to a
local file system.  After I'm comfortable with it, I'll start looking
into how to push the data into the grid.

The Backupdb
============

-- Purpose & Structure --

The core of the backup data structure is a file called the backupdb.
It's purpose is to provide all of the information needed to either:

1.  Generate a new backup, by comparing data about the previous backup
with the current state of the file system; or

2.  Recover files from backup.

The backupdb may be cached locally on the backup server, but will
generally live on the grid.

The backupdb must change with each backup run, to reflect the
locations of backed-up copies of files and metadata, and to record the
new state for comparison with future backups.  To handle this, I plan
to treat the backupdb as a versioned file.  Each revision contains a
snapshot[1] of the complete state of the backed-up file system (or
directory) -- everything but the actual file contents. The same
general mechanisms used to handle versioning of individual files will
be used to handle backupdb versioning.

The backupdb will be in JSON format and contain the following for each
file in the backed-up set:

o  Complete pathname
o  File hash (base 62)
o  Platform-dependent metadata, not including filename/path
o  Write cap of the backup file revlog (see below)
o  Write cap of the backup file index (see below)
o  Librsync signature

The backupdb will also have a header containing the file log format
version number, backup start timestamp and backupdb revision number;
and a footer containing the backup completion timestamp.

-- Efficiency --

The most common use of the backupdb is to generate a new backupdb (and
incidentally back up some files).  To make that efficient, I'll take care
to ensure that although the entries don't have to be in any particular
order, all entries with a common prefix (aka common parent directory)
are contiguous.

This provides a nice way to allow streaming up and downloads of
backupdbs during a backup, without ever having to have full copies of
either the previous or current backupdb available in local storage.
Basically, as the old backupdb comes down, I can check its contents
against the file system state and simultaneously generate and upload
the new backupdb.

The other common use of the backupdb is to find and retrieve a
specific revision.  To do that, I have to search the appropriate
backupdb revision for the file entry.  To make finding the appropriate
backupdb revision fast, there will be a simple index of backupdb
revision numbers and ending timestamps.  It might also be nice to have
an index to speed up locating the file within the backupdb, but I'll
leave that for later.

File and Increment Storage
==========================

Files will be stored in a revlog-ish format. Like Mercurial's revlogs,
each file will have an index file of compact, fixed-length records,
and a log file of variable-length records containing file content.
Both files are append-only.  Unlike Mercurial's revlogs, the revlog
entries for full revisions will not contain the revision text.
Instead, they'll contain the read cap of an immutable file.  The
reason for separating out the full revisions is to take advantage of
the idempotency of writing immutable files to the grid with a shared
convergence secret.

Specifically, each revlog entry contains:

o  Entry content
o  Entry content size (6 octets)
o  Entry type (1 octet)
o  File hash (16 octets)

The entry type is one of:

  0 - Full revision
  1 - Librsync delta
  2 - bsdiff delta
  3 - xdelta delta
  4 - diff-style patch

The entry content for full revisions will contain a read cap.  For the
delta types, it will contain the delta data.

Note that the revlog entries are "backwards".  Normally, one would
expect the fixed-length header to precede the variable length data, so
that the entry can be read front to back.  In this case, the entry
must be read back to front.  The reason for this is so that entries
can be stream-uploaded as they're generated, without need for a
temporary file to hold the data while the size is computed.

The revlog also has a header containing the the revlog format version.

The index file contains:

o  File hash (16 octets)
o  Revlog base revision number (4 octets)
o  Revlog current revision offset (last full revision) (6 octets)
o  Revlog current revision size (6 octets)
o  Backupdb revision number (4 octets)

Here I diverge a bit from the Mercurial revlog approach.  Under the
Mercurial approach, all that's needed to find the stream of deltas to
construct a particular revision is the revlog offset of the base
revision (the last full revision) and the hash of the target revision.
With those, it's a simple matter of grabbing the base revision data
and walking down the revlog applying successive deltas until the
desired version is constructed.

Unfortunately, that approach is inconsistent with the requirement to
allow streaming uploads of deltas as they are generated.  To walk up
the revlog file, it's necessary to put the size of the delta above the
delta, so as to indicate where the delta ends.  Either that or use
some sort of end-of-delta marker, which must then be escaped if it
appears in the delta data (requiring escaping the escapes that appear,
etc.).

Instead, I choose to rely more heavily on the index.  The index
provides the number of the base revision, which allows its index
record to be quickly found.  Then by walking down the index, I can
collect the set of revlog offsets I need to retrieve in order to
construct the file.

The index file also has a header containing the read cap of its revlog
and the index format version.

The backupdb revision history will be stored in exactly the same way.

Unfortunately, the revlogs and index files require some things from
the grid that, as far as I can tell, Tahoe doesn't yet support.
Specifically:

1.  Medium Distributed Mutable Files.  MDMF will allow for fast
    appends to both revlog and index, but it isn't implemented yet.

2.  Random access.  It shouldn't be necessary to download the whole of
    the index and revlog files before extracting the pieces actually
    required.

Workarounds
===========

To work around the current Tahoe limitations, I plan to turn the
single-file revlogs into multi-file directories.  The index will still
be a single file -- it should be small enough that completely
rewriting it each time isn't a problem -- but the revlog will become a
series of files with names of the form <version>.<type>.<hash>.  The
index fields that describe the revision offset and size will collapse
into a single version number field.

Similarly, the two write caps in the backupdb entry will collapse into
a single write cap for this dirnode.

I'll follow this up in a couple of days with another e-mail detailing
the process.  Hopefully by then I'll have some running demo code, too.

Thanks,

   Shawn.

[1] It won't actually be an instant-in-time snapshot, of course,
unless the system is inactive or other tools like ZFS or LVM are used
to ensure a static, unchanging image.  Since there's nothing I can
really do about that, I ignore it.

[2] I have plans to allow "chained" backupdbs for interesting and
useful reasons which I'll ignore here.  In that situation, a directory
entry in the backupdb actually will have a read or write cap stored in
it, pointing to another backupdb which holds the data about that
subdirectory.  In that case, nothing within that subdirectory will
be listed in the "parent" backupdb.