[tahoe-dev] Thinking about building a P2P backup system

zooko zooko at zooko.com
Thu Jan 8 08:17:40 PST 2009


On Jan 7, 2009, at 17:07 PM, Shawn Willden wrote:

> If the images are coming from a handful of 256 kbps connections  
> where Tahoe is bandwidth-capped to use no more than 100 kbps in  
> order to keep some bandwidth available for other stuff (does Tahoe  
> have bandwidth limiting?  If not, it probably needs it), then the  
> aggregate data stream may be no more than a 400-500 kbps.

No it hasn't, and yes it probably does:

http://allmydata.org/trac/tahoe/ticket/224 # bandwidth throttling

What if the data is coming from ten different connections, each of  
which runs at about 100 kbps.  Do you think that might be  
sufficiently high bandwidth for photo sharing?

> And let's not even talk about HD video.

Oh but I like HD video!

http://testgrid.allmydata.org:3567/uri/URI%3ADIR2% 
3Adjrdkfawoqihigoett4g6auz6a% 
3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/video/ 
dirac_codec-use_VLC_to_play

> I'd expect a lot bigger performance issue from the erasure coding  
> (BTW: ever considered Tornado coding instead of Reed-Solomon?).

Yes!  In fact, allmydata.com once had a commercial licence from  
Digital Fountain to use their patented erasure codes and their  
proprietary implementation.  That erasure code was used in the  
previous generation of allmydata.com's commercial product.  When we  
decided to open source the next generation, which became Tahoe and  
Allmydata v3.0, we also investigated whether we could make do with an  
unpatented, open source erasure code, and discovered that Rizzo's  
classic Reed-Solomon implementation worked great -- which became zfec.

Also, zfec and tahoe cheat as much as possible.  The best  
optimization is always to cheat and not do the work at all.  So, the  
first K shares that zfec creates are actually just the content of the  
file -- i.e., what you would get from running the unix "split"  
command, and tahoe tries to download those first K shares (what we  
call "primary shares") first, and the more of them that it gets, the  
less math that it actually has to do to reconstruct the file.  If it  
gets all K of the primary shares then it doesn't do any math at all  
when downloading, it just "unsplits".  :-)

By the way, there is a paper coming out in FAST '09 about the  
performance of open source software erasure codes that measures zfec  
among others.

> Okay, so here's a possibility.  If I can ensure that K shares are  
> stored on my mom's machine, and if Tahoe is clever enough to use  
> those shares when she's browsing those files (doesn't seem  
> difficult), rather than pulling from the network, then perhaps  
> browsing my photos will be fast enough.  The RS reconstruction and  
> the decryption shouldn't be a big deal, and neither should applying  
> a short sequence of forward deltas.  Some performance testing is in  
> order.

Yes!  Performance testing!  We have some automated performance  
measurements here:

http://allmydata.org/trac/tahoe/wiki/Performance

It it telling me that my patches to refactor immutable files last  
week and make them more robust against bizarre patterns of corruption  
have slowed down the upload and download speeds.  :-(

By the way, one unfortunate thing about the way that we are using  
rrdtool to keep those performance graphs is that we're losing history  
which is more than one year old.  :-(

> Cool.  That's probably good enough that the added optimization of  
> avoiding the storage of common files completely isn't worth the  
> effort.

Have you seen this thread?  It might be a good project for you, as it  
is self-contained, requires minimal changes to the tahoe core itself,  
and is closely related to your idea about good backup:

http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html

> Okay.  I grabbed the darcs repo (dang is that sloowww! Anybody for  
> switching to git? ;-)) and I'll start from there.

I updated the instructions on http://allmydata.org/trac/tahoe/wiki/ 
Dev to suggest using darcs-v2 and to warn that using darcs-v1 will  
take tens of minutes for the initial get.  I would entertain the idea  
of switching to git, even though I love darcs and contribute to darcs  
and use it all the time, solely in order to be more friendly toward  
potential contributors who love git.

> I haven't had a chance to look through the code much yet.  Is there  
> an overview document somewhere that covers the structure?

Start here:

http://allmydata.org/trac/tahoe/wiki/Doc

Then update the wiki and/or submit patches making it easier for the  
next person who starts there to find what they are looking for.  :-)

Regards,

Zooko
---
Tahoe, the Least-Authority Filesystem -- http://allmydata.org
store your data: $10/month -- http://allmydata.com/?tracking=zsig


More information about the tahoe-dev mailing list