[tahoe-dev] connection timeouts on VG2

Mon Mar 26 11:54:13 UTC 2012

After actually reading all of your note, a few more comments:

  If the keepalive is set to 120s, then it seems simply broken (and
  likely easy to fix) to have the keepalive happen at 240s.  120s as a
  keepalive timer should mean that a keepalive is sent 120s after the
  last recorded transmission.

  It would be nice if tahoe nodes scoreboard these timeouts (sqlite or
  whatever), and for a node that is noticing timeouts to many peers to
  ratchet down its timeout automatically.  A node that notices timeouts
  to one peer should perhaps lower the timeout for that one peer.  (I
  concur with your notion of scoreboarding for reporting, as well, but
  am making a further suggestion.)

  You said that the only way we know connections have failed is to wait
  for TCP to give up.  I don't think that's true; not hearing back in
  10s is a pretty big clue.  It occurs to me that opening a new
  connection, retrying the RPC on it, and closing the old one if
  immediately is not so bad; since at 100x RTT surely the old one is
  messed up somehow.  This feels a little marginal on meeting the
  congestion-control norms, but I would think that if it really is 100
  or 256x the RTT, it would be socially acceptable.

  I hate to cater to broken devices, but having a 270s keepalive (after
  fixing it to mean what it says) might be reasonable, since I suspect
  5m is pretty common.   On the other hand, with adaptive keepalive
  setting based on lossage, not having any keepalives becomes a more
  reasonable going-in position.

  So on balance I favor either no keepalive or a several hour default
  keepalive, switching to 270s for any peer that has a stall, with the
  270s being sticky for a week.  Optionally a node can further put 270s
  on all peers, if at least half the peers have had stalls within a few
  hours.

  It's interesting to consider not keeping connections open, but instead
  opening them on demand and closing after say 1m of idle time.
  There's a slight latency hit, but it should avoid a lot of issues and
  result in a lot fewer standing connections.   This sort of implies
  that only the forward direction is used for RPC origination, but I
  think that's a good thing, because backwards RPCs hide connectivity
  problems - while things work better on average they are harder to
  debug.

  How's IPv6 coming?