[tahoe-dev] stop permuting the peerlist per storage index? (was: repairer work)

Sat Jul 12 13:11:40 PDT 2008

On Jul 11, 2008, at 15:56 PM, Brian Warner wrote:

 > If we had a list of all the shares that were present on that drive
 > before it died (a non-trivial but solveable task), then we could
 > initiate repair work on that list. We could even maintain a big
 > database of which share was stored where, and use that to count the
 > remaining shares: then rather than repairing upon the first scratch,
 > we could wait until the file was actually in danger to do the repair
 > (say, 5 shares left). This "repair threshold" could be tuned to
 > balance the amortization rate of repair work (repairs per unit time)
 > against the desired reliability (probability that you'd lose the
 > remaining necessary shares before repair finished). This technique
 > could cut our repair work by a factor of five relative to the
 > instant-repair model.

This reminds me of the persistent issue of "should we stop permuting
peerlist, use storage index as offset into ring instead?".

I recently re-read the ticket (#302), and I continue to suspect that
the kind of load-distribution which the permuted peerlist adds might
turn out to be unnecessary or even actively detrimental to operations.

There are several potential advantages to the simpler technique of
keeping the list of peers in a single global ring according to their
node ids, and then letting the storage index be an index into that
ring.

Please keep in mind that I am not asserting that people *will* need to
do the following things, only that they *might*, and that if Tahoe
v1.2 permutes the peerlist for each storage index then it will
*prevent* people from having the option of doing these things later:

1.  It's simpler and easier to explain.

2.  You can use the list of shares stored on your neighbors as a "free
database" of which shares were likely to have been stored on you.
This is a way to implement the "list of all the shares that were
present on that drive before it died" or the "even maintain a big
database" without having to do anything now to start collecting that
information.  This might not be the best way to implement these two
features, but on the other hand it might turn out to be the best way.
If Tahoe v1.2 uses a simple ring of peers, then we get this for free.
If Tahoe v1.2 permutes the peerlist, then we will not have this
information available later.

3.  You can use the placement of servers on the ring for load
management.  Suppose there are some servers which are filling up too
rapidly and you want traffic to go to them less often but you don't
want to actually put them into read-only mode.  You can do this ("you"
being a Tahoe operator, not a Tahoe developer) by placing newly
deployed, empty servers before them in the ring.

4.  You can use the placement of servers on the ring for catastrophe
decorrelation.  Suppose you have three colos and forty-five storage
servers and you use 3-out-of-12 encoding.  You (you being a Tahoe
operator, not a Tahoe developer), can arrange for optimal placement of
your files by making the sequence of servers around the ring
alternate: colo1, colo2, colo3, colo1, colo2, colo3, etc..  This would
be optimal placement in the sense that the failure of any two of the
colos cannot destroy any file, and also that any single-share failure
can be repaired using only intra-colo bandwidth.

So these are some interesting reasons why it might be good for Tahoe
v1.2 to use a simple ring (and perhaps also to make it easier for
people to select the location of server on the ring).  I am
particularly interested in the fact that #2 and #3 can be done by a
Tahoe operator without changing any Tahoe source code or requiring any
help from a Tahoe developer.  (By the way, I can think of some
potential problems with using these techniques, and some potential
solutions to those problems.  We won't really know whether we want to
use these techniques until later, at which point we may be handicapped
if the version of Tahoe at that time still uses permuted peerlists.)

Now what is the reason to use the permuted peerlist instead of the
simple ring?  It is that the permuted peerlist has a subtly more even
load-distribution property in the case of several servers being maxed
out at once.  (Brian: please correct me if that sentence isn't
accurate.)  The precise effect of this subtly more even distribution
is unclear to me, and I rather suspect that it would be either
negligible or actually detrimental in practice.

So far, allmydata.com *has* experienced having lots of maxed out
servers, and deploying lots of new servers to add to the mix.  As far
as I can tell, the permuted peerlist feature has been neither a help
nor a hindrance to allmydata.com in this operation.

I'll bet that a simple simulation could clarify for me and for others
what effects the permuted peerlist has on load distribution.  Perhaps
I'll cook one up in Python and post it to this list to server as an
aid to reasoning.

Regards,

Zooko

tickets mentioned in this e-mail:

http://allmydata.org/trac/tahoe/ticket/302 # stop permuting peerlist,  
use SI as offset into ring instead?