[tahoe-dev] the small nodes problem

Brian Warner warner at lothar.com
Sun Dec 27 22:33:37 PST 2009


Jody Harris wrote:
> I am forwarding this discussion per David-Sarah's suggestion. I was
> not sure it was pertinent to the list at the time I initiated the
> conversation based on a statement made in another thread.

David-Sarah is right.. tahoe-dev is a great place for questions like
these!

>> My reason for asking is that I am trying to pull together a
>> tahoe-lafs grid, and I have one participant who insists on creating
>> "discrete" nodes (1-4GB) in virtual machines. He convinced that this
>> is the best method. I have attempted reason, but it has failed. I did
>> NOT see the implication you pointed out in your message.

Yup. Each node is considered an equal participant in the
server-selection process. Really tiny nodes are likely to be filled
quickly, after which they just sort of get in the way. The downloader
always has to ask them if they have a share or not (because that might
be the one file they're holding). The current implementation uploader
will ask them if they'll hold a share (even though they'll probably say
"no"), so they'll add to the number of round trips that are spent before
the upload really starts and slow things down overall. Eventually, when
we rewrite the uploader, we should be able to improve this to skip over
known-full servers and avoid spending those RTTs.

It's not that anything will really break if you have both small nodes
and large ones, it's just that things will degrade slightly. And if your
traffic rate is low, and a 4GB machine takes months to fill, then hey,
it's useful to have around. Make a guess as to your overall upload rate,
then measure your servers in units of time. My DSL line can upload about
4GB per day, with 3-of-10 encoding that represents about 400MB per
server per day, so I could fill a 4GB server in about ten days..
probably not as useful as I'd like. A 40GB server that would take me
three months to fill would seem pretty big.

>> If this is true, it should be documented somewhere. If my
>> extrapolation of your statement is correct, then it could be
>> beneficial to the grid in the extreme long-run to remove smaller
>> contributions to the grid (40-200GB) at some point in the grid's
>> life, if 1-4 TB storage nodes became the nominal storage node size --
>> depending on the usage of the grid and other factors, of course.

You're right, we should find a way to document it. We kind of need a
document that explains what sorts of grids we think will work well, and
which might capture some of the experience we've had so far with grids
of various shapes and sizes.

I guess we've not gotten around to it because we don't really know how
to address it. There are actually several use-cases we don't really know
how to handle:

 * large numbers of nodes: like thousands or millions. We can handle a
   few hundred just fine, haven't tried a thousand yet.
 * high node churn: nodes coming and going quickly. Our experience so
   far is with servers that stay up at least couple of days at a time.
 * low node reliability. We've experienced a few hard drive failures.
 * high diversity of node bandwidth
 * high diversity of node capacity (disk size)

Some projects have taken on these sorts of tasks: a lot of the DHTs
(Chord, Kademlia, Bamboo, Gnutella, Freenet) are built to handle
millions of nodes. And we've got a few ideas about how to go in that
direction. But for the most part we've made a choice to narrow Tahoe's
scope in the interests of getting it done and getting it working for our
initial set of use cases: friendnets and allmydata.com .

Handling large numbers of nodes will probably involve a log(N) overlay
network, to reduce the number of simultaneous connections that must be
maintained. It will probably also necessitate improvements in the way we
keep track of which pieces of data are on which servers, and if we can
do that efficiently, we can probably handle capacity-diversity pretty
well.

Diversity of bandwidth is trickier.. taking advantage of a node that's
behind a dialup line is tough. That would involve some clever (and
patient) scheduling, in which you tell your client that you want a
couple of files eventually, and it then decides the best order to pull
them down, and tells you when you can expect to have them (perhaps
several days from now). A system like that might use the slow nodes
purely for backup copies: putting 'k' shares on fast nodes and stashing
the rest of the redundancy behind the slow links.

Diversity of reliability and node churn are the two hardest problems.
There's a paper by Blake and Rodrigues ("High Availability, Scalable
Storage, Dynamic Peer Networks: Pick Two") that argues it simply can't
be done. The problem is that when nodes disappear at random, you have to
assume they won't come back, so you have to repair quickly, and you can
easily end up spending all of your bandwidth on repair and have nothing
left over for real traffic. I personally believe there are economic
incentive schemes that could help here (to encourage people to stick
around longer, like if they lost money for disappearing suddenly), but
MojoNation/Mnet were going in that direction, and their digital credit
schemes were among the first bits of complexity that were discarded when
Tahoe (and its predecessor) were started.


So, in short, your grid will probably work better if it has servers that
will take months to fill instead of days, and has dozens or maybe
hundreds of them instead of thousands. In other words, it should look
somewhat like the grids we already have experience with :-).

cheers,
 -Brian



More information about the tahoe-dev mailing list