[tahoe-dev] Thinking about building a P2P backup system
Shawn Willden
shawn-tahoe at willden.org
Wed Jan 7 16:07:29 PST 2009
On Wednesday 07 January 2009 03:09:21 pm zooko wrote:
> So, if you're planning to contribute patches, bug
> reports, documentation, etc., then I'm delighted!
Well, assuming I can get my head around the codebase sufficiently in the
snatches of time I have available, I absolutely want to contribute all of the
> Have you tried it? It might be just fine for sharing photos. I use
> Tahoe to share photos, but I use the public test grid instead of a
> private grid, and so I'm using many servers located in a co-lo plus a
> handful of random servers operated by Tahoe hackers or curious
> users. It seems to work fine.
I imagine you also have a pretty fast network connection yourself, too. Not
to put too much emphasis on my particular case, but I shoot with a moderately
high-end DSLR so my image files tend to be large, and most of my family has
low-end DSL connections. At 1 mbps, it takes at least 40 seconds to download
a 5 MB image file, which would be painfully slow for browsing through
pictures -- and that's if the pipe can be filled. If the images are coming
from a handful of 256 kbps connections where Tahoe is bandwidth-capped to use
no more than 100 kbps in order to keep some bandwidth available for other
stuff (does Tahoe have bandwidth limiting? If not, it probably needs it),
then the aggregate data stream may be no more than a 400-500 kbps.
And let's not even talk about HD video.
> As far as I know, we are doing adequately well on that goal. A few
> times people have asked to have the option to turn off the
> encryption, and in each case I asked them to please measure the
> performance and tell me if the encryption is causing a performance
> problem or another kind of usability problem.
I'd be shocked if encryption were a performance problem. Crypto stuff has
been my day job for over a decade, so I'm well aware of how blisteringly fast
AES is, and RSA isn't too bad as long as you're not doing too much of it
(especially if you're doing mostly public key ops, not private key). I'd
expect a lot bigger performance issue from the erasure coding (BTW: ever
considered Tornado coding instead of Reed-Solomon?).
However, my real concern isn't CPU usage, particularly since the heavy lifting
happens during storage, not retrieval. I'm thinking about bandwidth, both
being able to rsync changes -- important because most home users' net
connections are very asymmetric -- and to avoid hitting the network at all in
the "Mom browsing my photos" case.
I'm talking from theory here, not measurements, but I think I can predict
pretty well what the performance of the sort of network I'm thinking about
would be.
> I want Tahoe to offer the user (human or computer) more control and
> more knowledge about which shares go to which storage server.
Okay, so here's a possibility. If I can ensure that K shares are stored on my
mom's machine, and if Tahoe is clever enough to use those shares when she's
browsing those files (doesn't seem difficult), rather than pulling from the
network, then perhaps browsing my photos will be fast enough. The RS
reconstruction and the decryption shouldn't be a big deal, and neither should
applying a short sequence of forward deltas. Some performance testing is in
> Yes, that's what it currently does (if you chose to share your "added
> convergence secret" with all clients on the backup network).
Cool. That's probably good enough that the added optimization of avoiding the
storage of common files completely isn't worth the effort.
> > To improve this, storage servers could index their local files and
> > note when a request to store a share for a file they possess arrives.
> By the way, the GNUnet project offers that feature, so you should
> check them out.
Thanks, I'll take a look.
> > Next, I want incremental backups and versioning, and I want them to
> > be done bandwidth-efficiently.
> Have you seen the duplicity plugin that Francois Deppierraz posted?
> Maybe that does exactly what you want. :-)
I'll look, but if it works at the tarball level like duplicity, then no, it's
not what I want.
> I would prefer if you used Tahoe and contribute patches, and if it
> turns out that there is some behavior that you really want and that
> seems to troublesome to me to risk including it in my codebase, then
> I would prefer that you copy the Tahoe darcs repository and develop
> your own branch.
Okay. I grabbed the darcs repo (dang is that sloowww! Anybody for switching
to git? ;-)) and I'll start from there.
I haven't had a chance to look through the code much yet. Is there an
overview document somewhere that covers the structure?
More information about the tahoe-dev
mailing list