[tahoe-dev] Backup plans, part 1
Shawn Willden
shawn-tahoe at willden.org
Mon Jan 26 08:42:42 PST 2009
Prototyping uncovered the need for a few changes:
On Thursday 22 January 2009 09:44:52 pm Shawn Willden wrote:
> The same general mechanisms used to handle versioning of individual
> files will be used to handle backupdb versioning.
Since both the previous and current versions of the backupdb are available
during backup, I won't use the rdiff delta method. Also, since I'm
processing both of them anyway, it's most convenient to use an ad-hoc, sort
of patch-like delta format that expresses additions, changes and removals of
whole records.
The change records will naturally end up sorted by pathname, which helps in
another way -- each delta will be efficiently searchable, even when large.
This means that rather than downloading a full backupdb revision plus a
series of deltas, then applying the deltas to update the backupdb to the
desired version (HEAD, usually), then searching for a file, it will be
possible to search the each full revision or delta separately in sequence to
find the latest information about a particular file. This avoids the need to
actually apply the changes and works particularly nicely if random access to
grid-based files is implemented.
> The backupdb will be in JSON format and contain the following for each
> file in the backed-up set:
>
> o Complete pathname
> o File hash (base 64, 16 octets)
> o Platform-dependent metadata, not including filename/path
> o Write cap of the backup file revlog (see below)
> o Write cap of the backup file index (see below)
> o Librsync signature
It turns out that this creates files that are simply too big to work with
conveniently for backups of large file systems (and I've only got a little
over 2TB!). In particular, it's nice to be able to mmap() the backupdb, and
that's not generally possible with large files -- 32-bit platforms have a
hard limit at 2 GiB, and few handle mmap'ing of large files well.
To make the backupdb file smaller, I'm going to separate librsync signatures
into other files. Without those, backupdb records are < 100 bytes each (on
Linux, at least -- we'll see what happens when we throw ACLs and resource
forks into the metadata).
The signatures will be placed in separate files, up to 2**(64*n) of them,
selected based on the first n characters of the base-64 file hash. n will
default to 1, but can be increased if needed (no idea on the heuristics of
that, but I suspect we won't care for a long time). The signature files will
contain:
o File hash (base 64, 16 octets)
o Librsync signature (base 64, variable length)
Versioning signature files will be with with record-based deltas.
A related issue that cropped up is that the backup code ends up hashing each
file twice: Once with the librsync rolling hash, and then again with SHA-256
to generate the file hash. To reduce this duplication of effort, the "file
hash" will actually be the SHA-256 hash of the librsync signature.
> The most common use of the backupdb is to generate a new backupdb (and
> incidentally back up some files). To make that efficient, I'll take care
> to ensure that although the entries don't have to be in any particular
> order, all entries with a common prefix (aka common parent directory)
> are contiguous.
I missed something obvious here: prefix-contiguity may still make finding
particular entries difficult and makes streaming comparison difficult to
implement (and potentially requires an arbitrary amount of memory). Sorting
the entries provides prefix-contiguity and allows efficient lookups and merge
comparisons. Also, it's very cheap to do: Just do an in-order traversal of
the filesystem tree.
Shawn.
More information about the tahoe-dev
mailing list