[tahoe-dev] detecting weak uploads and share rebalancing Re: Share rebalancing

Mon Oct 19 14:36:44 PDT 2009

N.B. there are four different arguments about issue #302 in this  
letter; don't mix them up.

On Monday,2009-10-19, at 13:34 , Brian Warner wrote:

> .. and we'd suffer from the "lumpy distribution" problems that are  
> discussed in ticket #302, where servers get unequal load depending  
> upon where they sit in the ring, and where servers who become full  
> can dump inordinate amounts of traffic on the poor node just  
> clockwise from them.
... [reordering quotes from your message]
> I still believe that permuted-ring gives us better overall  
> behavior. I'm willing to be proved wrong, though :).

1.  I wrote a simulation which convinced me that this is wrong --  
that both share placement algorithms have an indistinguishable (and  
highly varying) pattern of servers filling up.  However, the results  
that I posted to tahoe-dev were confusing and hard to follow, and you  
seem to have ignored them.  I see that I didn't link them in from  
#302 either.  I should go find that letter in tahoe-dev archives and  
link to it from #302.  Here it is: http://allmydata.org/pipermail/ 
tahoe-dev/2008-July/000676.html .

> And, an attacker who took out several neighboring-in-id-space  
> servers would kill or seriously damage several files (if they took  
> out N-k+1 consecutive servers in a non-trivially utilized grid,  
> they'd be guaranteed to kill some files).

[snip more interesting consequences of placement strategy on an  
attacker who attacks many servers]

2.  Neat!  I hadn't thought of this malicious case before.  Perhaps  
you could add a link from #302 to your letter about the malicious case.

3.  We were already familiar with designs like Chord (and Chord File  
System) and Kademlia when you (Brian) came up with the "permute per  
fileid" trick.  It should be seen as an (arguable) improvement on  
those designs.  However, Chord and Kademlia have been deployed with  
success, sometimes on a massive scale -- e.g. Cassandra DB [1] and  
Vuze [2] -- where load-balancing is also an issue.  This suggests  
that either this phenomenon isn't a problem in many situations in  
practice (which would be consistent with my simulation -- argument 1)  
or that the designers of Cassandra DB and Vuze ought to think about  
adopting the permute-per-fileid trick (or both).

In fact, Cassandra's unique appeal among "post-relational" (a.k.a.  
"nosql") databases is that it supports range queries, and the way it  
does so relies upon the "natural" chord ordering.  If you're familiar  
with Chord, you can think of Cassandra as being a lot like Chord plus  
the added feature of *not* running the keys through a secure hash to  
load-balance them.  This is an "improvement" on Chord in the opposite  
direction of our improvement! :-)  It makes the load-balancing  
properties much *worse* than the standard Chord load-balancing.   
Assuming anyone actually uses Cassandra in this mode, then this  
demonstrates that the sort of "balancing tools" discussed in e.g. [3]  
are usable to some people.

4.  This thread started because Shawn Willden needed to do some  
mucking about with his shares, and the permute-per-fileid feature  
makes it harder for him to muck his shares.  This is a real live  
example of my argument (in e.g. http://allmydata.org/pipermail/tahoe- 
dev/2008-July/000672.html ) that the simpler placement strategy can  
help people administer their Tahoe-LAFS deployments.  I need to link  
to this thread from #302 and claim that this is an example.

Of all these four arguments, I think argument 4 is the most important.

I think the next steps here are to document the arguments better on  
ticket #302 and also to create a new separate ticket which is all  
about making per-file-permutation optional (leaving #302 as the  
repository of arguments about which is better in what situations,  
whether the default should be changed, etc.).

Regards,

Zooko

tickets mentioned in this email:

http://allmydata.org/trac/tahoe/ticket/302 # stop permuting peerlist,  
use SI as offset into ring instead?

[1] http://wiki.apache.org/cassandra/PoweredBy # says Cassandra is  
used for inbox search at Facebook, which is up to 40 TB of data  
across 120 machines in two separate data centers
[2] http://vanish.cs.washington.edu/pubs/usenixsec09-geambasu.pdf #  
says that Vuze has a million nodes at a time, spanning the globe
[3] http://allmydata.org/trac/tahoe/ticket/302#comment:7