[volunteergrid2-l] Fw: apparent correlation of server node response times

Brian Warner warner at lothar.com
Mon Mar 26 21:24:09 UTC 2012


On 3/26/12 1:17 PM, Johannes Nix wrote:

> What had called my attention was that not only some nodes take
> occasionally a long time for map-update, but that also if one node
> takes a time of several seconds, it seems way more frequent that other
> server nodes do too. Response times between server nodes are
> correlated.

A few possibilities come to mind:

1: some code doesn't parallelize well, e.g. immutable upload's
   server-selection logic asks one server at a time, waiting for it to
   respond before sending the next request. If the request-sent time
   isn't being recorded correctly, it might look like the later servers
   are being slow, when in fact it's just one of them. (I don't think
   this is the problem, though)

2: the client is really busy sending a lot of data, so the outbound
   requests are all stalled the same way. Or maybe most of the requests
   get sent, then a big bunch of responses are processed, then the last
   two requests make it out, somehow making it look like multiple
   servers are being slow when in fact it's the client's problem.

3: if multiple servers are having the NAT-induced silent connection loss
   problem, then a little trick in the Tahoe reconnection logic might
   confuse matters: each time we establish a connection, we restart the
   reconnection timers on all the other connections. The intention was
   to recover from a laptop-sleep well: you plug back into the network
   (which userspace doesn't learn about), some time later one of your
   Reconnectors finally fires and establishes one server connection,
   then all the others are immediately triggered and reconnect too.
   (before we did that, the client would spend a long time in a state
   where only one server was connected, which is obviously bad for
   reliability).

   So maybe you've got two lame servers, there's a blocking operation
   that won't complete until it's heard from all of them, the first one
   finally hits it's keepalive timeout and hangs up and reconnects, and
   then that triggers the second one to reconnect too.

You might look at the "Connected Since" column on the Welcome page,
checking up on it every 15 or 30 minutes. You're looking for a set of
servers with connections that are never more than maybe 30 minutes old.
If there are multiple such servers, and they're the same ones that are
consistently "slow", that suggests problem #3. Also, if some slow
operation finally finishes at say 2:15pm, and immediately afterwards the
Connected Since column shows the slow servers as being connected since
2:15pm, that's a strong piece of evidence.


cheers,
 -Brian


More information about the volunteergrid2-l mailing list