[tahoe-dev] Thinking about building a P2P backup system
zooko
zooko at zooko.com
Thu Jan 8 08:17:40 PST 2009
On Jan 7, 2009, at 17:07 PM, Shawn Willden wrote:
> If the images are coming from a handful of 256 kbps connections
> where Tahoe is bandwidth-capped to use no more than 100 kbps in
> order to keep some bandwidth available for other stuff (does Tahoe
> have bandwidth limiting? If not, it probably needs it), then the
> aggregate data stream may be no more than a 400-500 kbps.
No it hasn't, and yes it probably does:
http://allmydata.org/trac/tahoe/ticket/224 # bandwidth throttling
What if the data is coming from ten different connections, each of
which runs at about 100 kbps. Do you think that might be
sufficiently high bandwidth for photo sharing?
> And let's not even talk about HD video.
Oh but I like HD video!
http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%
3Adjrdkfawoqihigoett4g6auz6a%
3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/video/
dirac_codec-use_VLC_to_play
> I'd expect a lot bigger performance issue from the erasure coding
> (BTW: ever considered Tornado coding instead of Reed-Solomon?).
Yes! In fact, allmydata.com once had a commercial licence from
Digital Fountain to use their patented erasure codes and their
proprietary implementation. That erasure code was used in the
previous generation of allmydata.com's commercial product. When we
decided to open source the next generation, which became Tahoe and
Allmydata v3.0, we also investigated whether we could make do with an
unpatented, open source erasure code, and discovered that Rizzo's
classic Reed-Solomon implementation worked great -- which became zfec.
Also, zfec and tahoe cheat as much as possible. The best
optimization is always to cheat and not do the work at all. So, the
first K shares that zfec creates are actually just the content of the
file -- i.e., what you would get from running the unix "split"
command, and tahoe tries to download those first K shares (what we
call "primary shares") first, and the more of them that it gets, the
less math that it actually has to do to reconstruct the file. If it
gets all K of the primary shares then it doesn't do any math at all
when downloading, it just "unsplits". :-)
By the way, there is a paper coming out in FAST '09 about the
performance of open source software erasure codes that measures zfec
among others.
> Okay, so here's a possibility. If I can ensure that K shares are
> stored on my mom's machine, and if Tahoe is clever enough to use
> those shares when she's browsing those files (doesn't seem
> difficult), rather than pulling from the network, then perhaps
> browsing my photos will be fast enough. The RS reconstruction and
> the decryption shouldn't be a big deal, and neither should applying
> a short sequence of forward deltas. Some performance testing is in
> order.
Yes! Performance testing! We have some automated performance
measurements here:
http://allmydata.org/trac/tahoe/wiki/Performance
It it telling me that my patches to refactor immutable files last
week and make them more robust against bizarre patterns of corruption
have slowed down the upload and download speeds. :-(
By the way, one unfortunate thing about the way that we are using
rrdtool to keep those performance graphs is that we're losing history
which is more than one year old. :-(
> Cool. That's probably good enough that the added optimization of
> avoiding the storage of common files completely isn't worth the
> effort.
Have you seen this thread? It might be a good project for you, as it
is self-contained, requires minimal changes to the tahoe core itself,
and is closely related to your idea about good backup:
http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
> Okay. I grabbed the darcs repo (dang is that sloowww! Anybody for
> switching to git? ;-)) and I'll start from there.
I updated the instructions on http://allmydata.org/trac/tahoe/wiki/
Dev to suggest using darcs-v2 and to warn that using darcs-v1 will
take tens of minutes for the initial get. I would entertain the idea
of switching to git, even though I love darcs and contribute to darcs
and use it all the time, solely in order to be more friendly toward
potential contributors who love git.
> I haven't had a chance to look through the code much yet. Is there
> an overview document somewhere that covers the structure?
Start here:
http://allmydata.org/trac/tahoe/wiki/Doc
Then update the wiki and/or submit patches making it easier for the
next person who starts there to find what they are looking for. :-)
Regards,
Zooko
---
Tahoe, the Least-Authority Filesystem -- http://allmydata.org
store your data: $10/month -- http://allmydata.com/?tracking=zsig
More information about the tahoe-dev
mailing list