[tahoe-dev] Thinking about building a P2P backup system

Shawn Willden shawn-tahoe at willden.org
Wed Jan 7 12:37:30 PST 2009


Howdy,

Per the subject line, I've decided to build a backup system.  While
doing some research on what's already out there, a friend pointed me to
your project, and I have to say it looks really cool.  Your approach has
a lot in common with what I've been thinking about, although it's not
identical.  In most ways, yours is much more sophisticated.

Now I'm trying to decide whether I should continue with my previous
plans to build a solution more-or-less from scratch (actually on the
bones of DIBS), enhance your work to add the features I want (ideally
with your acceptance and support), or just steal some of your code.

Based on what I understand from reading your architecture documentation,
there are some aspects of Tahoe that I really like.  The architecture is
exceptionally clean, with nice sharp lines between grid, filesystem, and
backup solution.  The grid itself seems like a very nice design.  I
particularly like the method of consistently permuting the peer list on
a per-file basis and then walking as far down the permuted list as
necessary to find peers to hold the shares.  That has several really
nice properties.  The decision to separate file metadata from content is
very nice as well, and I'm sure there's lots of other excellent stuff
under the hood.

However, there are a few things I want in a P2P backup solution that
you're not (AFAICT) working towards.  Let me describe them and invite
your comments on how much you'd welcome such features.

First, I want to be able to use the "backup" system to also accomplish a
sort of targeted filesharing for a subset of the backed-up files.  The
exemplar use case is my digital photo album.  I'd like to back it up to
others' computers to keep it safe, but I'd also like some specific
people (e.g. my mom) to have high-speed access to it.  In the Tahoe
model I could push the image directory onto the grid and give her
read-only access to it, but given that for a relatively small network of
home machines (most of whom have sucky upstream bandwidth), that would
not achieve the "high-speed" goal.

Further, for this application, the RS coding, encryption, hash trees,
etc. are all pretty unnecessary (probably -- more below), and avoiding
them allows easy use of rsync or similar protocols to efficiently update
the backup when changes occur.  Well, rsync doesn't help with efficient
updating of photos, but with many other file types it would.

One approach, then, would be to use the Tahoe peering infrastructure and
authentication infrastructure (assuming Tahoe has such; I haven't read
everything yet) to identify and authenticate the targeted peers, and
then ignore the grid, etc., and use a more traditional remote backup
solution to transfer the shared files.  With that approach the sharing
solution could be completely separated from the rest of Tahoe.  It would
just use the peer database.

However, I see value in using the grid as well, in case there aren't
enough targeted peers to provide adequate reliability.  So perhaps
closer integration would make sense to allow the client to automatically
determine how many shares to push onto the grid in addition to the
targeted sharing (does Tahoe easily accommodate per-file values for k
and n?  Or does it actually make sense to alter them to account for the
existence of full backups?  I haven't thought this through).

Next, I want my backup system to be as close to configurationless in the
default installation as possible.  Ideally, the default configuration
should require nothing more than running an executable and typing in an
introducer URL and perhaps an e-mail address and system nickname.  That
means that by default, it should perform a complete system backup.

But there's not much sense in actually backing up common files to the
network.  For example, Windows system files will be available from
pretty much every peer on the network.  I'd like to have a mechanism to
notice such commonly-available files and avoid storing them at all.

I realize that any mechanism that allows such common-file discovery
entails a minor privacy risk.  Essentially peers will have to publish
hashes of their files, either directly or implicitly through what they
attempt to store, and that will allow other peers to know what files
they possess.  That privacy risk could even escalate to a security risk
if it divulges the existence of software with known remote-exploit
vulnerabilities.  However, I think that risk is acceptable in a small,
closed backup network.

As I understand the Tahoe architecture, it seems like it would be easy
to achieve something very close to this, trivially.  If the file
encryption key is derived from the file contents and the server index
generation is bijective (not sure what you mean by a "tagged hash"),
then the second and later clients attempting to store a given file will
find it already stored.  That's not quite as good as not storing it at
all, but it's very close.

To improve this, storage servers could index their local files and note
when a request to store a share for a file they possess arrives.  Then,
instead of agreeing to store the share, they should respond with a
message indicating that they already have it (or they could even
indicate that they have the whole file, which the client could take as
an indication that it doesn't need to store as many shares).

Next, I want incremental backups and versioning, and I want them to be
done bandwidth-efficiently.  My plan is to use rdiff (actually librsync)
and a little local storage on the client to store rdiff signatures of
every backed-up file.  Then, when the client notices that a file has
changed, it can compute an rdiff delta and push that to the backup
network, more or less as a new file to be stored, though obviously it's
necessary to maintain the linkage between the series of forward deltas
and the original full backup of the file.  At some point it may make
sense to retrieve the full file and all deltas and regenerate them as a
current backup plus a set of reverse deltas, so in general a versioned
file will consist of a series of forward deltas, a full copy, and a
series of reverse deltas.

I'm not sure what you are considering along those lines; your thinking
may be considerably more sophisticated than mine.

Finally, I want ultimately to be able to do a full-system restore.  I
don't think this would affect how Tahoe works in the slightest.  I'm
envisioning creating a Linux LiveCD that boots, prompts for an
introducer URL and account and authentication information, then formats
the drive (a partition thereof, anyway) and retrieves and installs all
of the files from the grid to completely reproduce the functional
system.  Ultimately, it may even attemtpt to deal with HAL issues for
Windows machines, but the primary goal would be a 100% functional
restore to the original hardware.

Anyway, those are the major areas where I see things need to be added to
Tahoe to achieve my goals.

Comments?  Would those sort of features be welcome additions to your
work?  Would you prefer that I just go away and do my own thing?

Thanks,

	Shawn.



More information about the tahoe-dev mailing list