[tahoe-dev] Thinking about building a P2P backup system

Shawn Willden shawn-tahoe at willden.org
Wed Jan 7 16:07:29 PST 2009


On Wednesday 07 January 2009 03:09:21 pm zooko wrote:
> So, if you're planning to contribute patches, bug
> reports, documentation, etc., then I'm delighted!

Well, assuming I can get my head around the codebase sufficiently in the 
snatches of time I have available, I absolutely want to contribute all of the 
above.

> Have you tried it?  It might be just fine for sharing photos.  I use
> Tahoe to share photos, but I use the public test grid instead of a
> private grid, and so I'm using many servers located in a co-lo plus a
> handful of random servers operated by Tahoe hackers or curious
> users.  It seems to work fine.

I imagine you also have a pretty fast network connection yourself, too.  Not 
to put too much emphasis on my particular case, but I shoot with a moderately 
high-end DSLR so my image files tend to be large, and most of my family has 
low-end DSL connections.  At 1 mbps, it takes at least 40 seconds to download 
a 5 MB image file, which would be painfully slow for browsing through 
pictures -- and that's if the pipe can be filled.  If the images are coming 
from a handful of 256 kbps connections where Tahoe is bandwidth-capped to use 
no more than 100 kbps in order to keep some bandwidth available for other 
stuff (does Tahoe have bandwidth limiting?  If not, it probably needs it), 
then the aggregate data stream may be no more than a 400-500 kbps.

And let's not even talk about HD video.

> As far as I know, we are doing adequately well on that goal.  A few
> times people have asked to have the option to turn off the
> encryption, and in each case I asked them to please measure the
> performance and tell me if the encryption is causing a performance
> problem or another kind of usability problem.

I'd be shocked if encryption were a performance problem.  Crypto stuff has 
been my day job for over a decade, so I'm well aware of how blisteringly fast 
AES is, and RSA isn't too bad as long as you're not doing too much of it 
(especially if you're doing mostly public key ops, not private key).  I'd 
expect a lot bigger performance issue from the erasure coding (BTW: ever 
considered Tornado coding instead of Reed-Solomon?).

However, my real concern isn't CPU usage, particularly since the heavy lifting 
happens during storage, not retrieval.  I'm thinking about bandwidth, both 
being able to rsync changes -- important because most home users' net 
connections are very asymmetric -- and to avoid hitting the network at all in 
the "Mom browsing my photos" case.

I'm talking from theory here, not measurements, but I think I can predict 
pretty well what the performance of the sort of network I'm thinking about 
would be.

> I want Tahoe to offer the user (human or computer) more control and
> more knowledge about which shares go to which storage server.

Okay, so here's a possibility.  If I can ensure that K shares are stored on my 
mom's machine, and if Tahoe is clever enough to use those shares when she's 
browsing those files (doesn't seem difficult), rather than pulling from the 
network, then perhaps browsing my photos will be fast enough.  The RS 
reconstruction and the decryption shouldn't be a big deal, and neither should 
applying a short sequence of forward deltas.  Some performance testing is in 
order.

> Yes, that's what it currently does (if you chose to share your "added
> convergence secret" with all clients on the backup network).

Cool.  That's probably good enough that the added optimization of avoiding the 
storage of common files completely isn't worth the effort.

> > To improve this, storage servers could index their local files and
> > note when a request to store a share for a file they possess arrives.

> By the way, the GNUnet project offers that feature, so you should
> check them out.

Thanks, I'll take a look.

> > Next, I want incremental backups and versioning, and I want them to
> > be done bandwidth-efficiently.
>
> Have you seen the duplicity plugin that Francois Deppierraz posted?
> Maybe that does exactly what you want.  :-)

I'll look, but if it works at the tarball level like duplicity, then no, it's 
not what I want.

> I would prefer if you used Tahoe and contribute patches, and if it
> turns out that there is some behavior that you really want and that
> seems to troublesome to me to risk including it in my codebase, then
> I would prefer that you copy the Tahoe darcs repository and develop
> your own branch.

Okay.  I grabbed the darcs repo (dang is that sloowww! Anybody for switching 
to git? ;-)) and I'll start from there.

I haven't had a chance to look through the code much yet.  Is there an 
overview document somewhere that covers the structure?

Thanks,

	Shawn.


More information about the tahoe-dev mailing list