[tahoe-dev] connection timeouts on VG2
Greg Troxel
gdt at ir.bbn.com
Mon Mar 26 11:54:13 UTC 2012
After actually reading all of your note, a few more comments:
If the keepalive is set to 120s, then it seems simply broken (and
likely easy to fix) to have the keepalive happen at 240s. 120s as a
keepalive timer should mean that a keepalive is sent 120s after the
last recorded transmission.
It would be nice if tahoe nodes scoreboard these timeouts (sqlite or
whatever), and for a node that is noticing timeouts to many peers to
ratchet down its timeout automatically. A node that notices timeouts
to one peer should perhaps lower the timeout for that one peer. (I
concur with your notion of scoreboarding for reporting, as well, but
am making a further suggestion.)
You said that the only way we know connections have failed is to wait
for TCP to give up. I don't think that's true; not hearing back in
10s is a pretty big clue. It occurs to me that opening a new
connection, retrying the RPC on it, and closing the old one if
immediately is not so bad; since at 100x RTT surely the old one is
messed up somehow. This feels a little marginal on meeting the
congestion-control norms, but I would think that if it really is 100
or 256x the RTT, it would be socially acceptable.
I hate to cater to broken devices, but having a 270s keepalive (after
fixing it to mean what it says) might be reasonable, since I suspect
5m is pretty common. On the other hand, with adaptive keepalive
setting based on lossage, not having any keepalives becomes a more
reasonable going-in position.
So on balance I favor either no keepalive or a several hour default
keepalive, switching to 270s for any peer that has a stall, with the
270s being sticky for a week. Optionally a node can further put 270s
on all peers, if at least half the peers have had stalls within a few
hours.
It's interesting to consider not keeping connections open, but instead
opening them on demand and closing after say 1m of idle time.
There's a slight latency hit, but it should avoid a lot of issues and
result in a lot fewer standing connections. This sort of implies
that only the forward direction is used for RPC origination, but I
think that's a good thing, because backwards RPCs hide connectivity
problems - while things work better on average they are harder to
debug.
How's IPv6 coming?
More information about the tahoe-dev
mailing list