[tahoe-dev] Thinking about building a P2P backup system
Brian Warner
warner-tahoe at allmydata.com
Fri Jan 9 12:33:39 PST 2009
On Fri, 9 Jan 2009 08:15:56 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:
> The problem with using reverse deltas is they're even more
> bandwidth-intensive than full versions. If you have plaintext on both ends
Yeah, this notion of "both ends" is tricky: you have to download the file to
get the plaintext, which requires N/k (e.g. 1/3rd) the bandwidth of uploading
the whole thing. Or you can keep a copy locally (doubling your local storage
needs) so that you have something to diff against.
The 'revlog' format that I'd like to use for our eventual "LDMF" (Large
Distributed Mutable Files) design uses forward deltas and an append-only data
structure, but writes out a full copy of the file whenever the amount of data
you'd need to read to construct the latest version gets bigger than twice the
size of the file. I think the Mercurial folks picked this 2x thing as a
reasonable tradeoff between speed and storage.
For our mutable files, this also gives us versioning: we create a read-cap
that contains both a pointer to the file overall (i.e. the revlog) and a
secure identifier for the particular revision that we care about. The
question of how far back to keep the deltas becomes a matter of retention
policy and GC, with the obvious limitation that the oldest thing in the
revlog should be a full copy.
The format I've been sketching out would allow for efficient insertions and
deletions of arbitrary byte spans, which would be pretty wild. The standard
POSIX file io API doesn't give any way to express this, but an application
that's speaking directly to Tahoe could take advantage of it. Imagine being
able to insert a byte in the beginning of your file in constant time.
> But if the remote store is encrypted, the delta can't be applied. All you
> can do is store it as a forward delta. To get reverse deltas, you have to
> upload a full copy of the new version, plus the delta.
>
> That's the primary reason why I suggested maybe Tahoe should optionally
> allow not encrypting the files, to facilitate reverse delta versioning.
> Zooko doesn't want to go there, and I respect his reasoning.
The limitation is actually a combination of encryption and erasure-coding
(which, unless you're using the degenerate k=1 replication case, involves
splitting the data up among multiple servers). Both make it hard or
impossible to apply rsync-like algorithms on the server side.
> If librsync deltas are reversible (need to check)
Incidentally, for deltas to be reversible, they have to contain roughly twice
the information as an irreversible delta. Darcs does this (because the
patches may need to be commuted arbitrarily), the result is that deleting a
file takes as much patch space as adding one.
cheers,
-Brian
More information about the tahoe-dev
mailing list