[tahoe-dev] [tahoe-lafs] #1170: does new-downloader perform badly for certain situations (such as today's Test Grid)?

Sat Aug 14 17:42:08 UTC 2010

On 8/14/10 6:15 AM, Greg Troxel wrote:
> 
>>  attachment:1.8.0c2-r4698-run-106-down-0.html and every request-block
>>  in it (for three different shares) went to the same server --
>>  nszizgf5 -- which was the first server to respond to the DYHB
>>  (barely) and which happened to be the only server that had three
>>  shares. So at least for that run, Brian's idea that fetching blocks
>>  of different shares from the same server is a significant slowdown
>>  seems to be true.
> 
> It's pretty well connected; I get 15 Mb/s to MIT and only 9 Mb/s to a
> machine on FiOS.

Yeah, that's why I want the downloader to do more (at least in the long
run) than merely try for diversity. Getting all three shares from Greg's
fast machine ought to give you a faster download than getting even one
share from a much slower machine (like less than 3x slower). Diversity
as the sole criteria only gives ideal performance when all the servers
can return shares at the same rate.

But, Zooko's seen repeatable longish-term slowdowns and has seen it
correlated with non-diverse server selection. So the next step of the
experiment is to change the downloader to try for diversity and see how
that affects its observed behavior.

My plan is to write up a preliminary patch (today or tomorrow), add it
to #1170, and have Zooko run more experiments. It may take me another
day or two to get tests written, so my plan is to make a not-unit-tested
patch first.

> As an aside, at some point I'm likely to take my machines out of the
> testgrid. Given how small the testgrid is, that's likely to cause
> trouble, but I'm taking it at face value that it's a *test*grid, and
> assuming the ensuing recovery problems that cause unusual code paths
> to be exercised will be viewed as a positive testing feature.

Yup. Folks who keep data they care about on the testgrid should be doing
a deep-check-and-repair probably weekly, and should manually check that
they can connect to enough servers to tolerate the loss of a few. Tahoe
is still in the phase where it works best on stable grids, but tolerance
to instability is still part of the design, and should be exercised
every now and then :).

cheers,
 -Brian