[tahoe-dev] connection timeouts on VG2

Brian Warner warner at lothar.com
Mon Mar 26 21:05:28 UTC 2012


On 3/26/12 4:54 AM, Greg Troxel wrote:

>   If the keepalive is set to 120s, then it seems simply broken (and
>   likely easy to fix) to have the keepalive happen at 240s. 120s as a
>   keepalive timer should mean that a keepalive is sent 120s after the
>   last recorded transmission.

Yeah, I have to admit it's kind of weird. It's because I designed that
code to minimize all forms of overhead:

 * don't send any extra network traffic unless it's really necessary
 * don't replace/reset the timer very frequently

The code starts a periodic timer that fires every self.keepaliveTimeout
(120s). Every time data is received on the connection, a timestamp is
updated: self.dataLastReceivedAt = time.time() . When the timer fires,
it checks how old the timestamp is, and sends an extra PING if
time.time() - self.dataLastReceivedAt > self.keepaliveTimeout . The
timer is restarted in any case.

The result is that the idle time before a PING is sent will be somewhere
between 1*keepaliveTimeout and 2*keepaliveTimeout.

It may be time to revise that code. I can think of three approaches:

 1: the existing design

 2: existing design but set a boolean flag for data-received instead of
 updating a timestamp. Same behavior, but fewer system calls. Loses the
 scalar information about how "fresh" a connection is, which is
 interesting from an ops point-of-view. Also requires the companion
 "disconnectTimeout" to be more tightly integrated.

 3: reset the timer to .keepaliveTimeout each time data is received,
 sendPING if it ever expires. Just as many system calls (since twisted's
 DelayedCall.reset() must call time.time() to get the absolute
 expiration time), more timer shuffling, but more deterministic
 behavior.

 4: a hybrid in which a short periodic timer (maybe 10 seconds) fires,
 checks the boolean flag, then resets the longer (120s) timer if any
 data was received during the window. That would still minimize overhead
 during busy periods, but reduce the variability down to something like
 (1*keepaliveTimeout .. 1*keepaliveTimeout+10)

I went with option 1 back in 2007 because I thought timers were
expensive (in particular Twisted has to re-sort the timer list each time
you change it). But I've never tested it to see how much overhead it
really adds.. Tahoe doesn't use too many timers, so re-sorting the list
for each arrival might not be such a big deal.

>   It would be nice if tahoe nodes scoreboard these timeouts (sqlite or
>   whatever), and for a node that is noticing timeouts to many peers to
>   ratchet down its timeout automatically.  A node that notices timeouts
>   to one peer should perhaps lower the timeout for that one peer.  (I
>   concur with your notion of scoreboarding for reporting, as well, but
>   am making a further suggestion.)

Yeah, I like that a lot. Maybe that should be in Foolscap itself, as
part of the Reconnector logic: when the connection is frequently
dropped, but consistently gets reestablished quickly, that's a sign that
a faster heartbeat might keep things alive. Something like dropping the
keepalive time in half each time the duration of the previous connection
was less than half the previous keepalive duration, with some reasonable
floor.

>   You said that the only way we know connections have failed is to
>   wait for TCP to give up. I don't think that's true; not hearing back
>   in 10s is a pretty big clue.

Yeah, unless the response is stuck behind a couple MB of data coming
back from the other 15 servers we're currently talking to, or the
video-chat session with which your roommate is hogging the downlink :).

>   It occurs to me that opening a new connection, retrying the RPC on
>   it, and closing the old one if immediately is not so bad; since at
>   100x RTT surely the old one is messed up somehow.

Right. The thing that I've never been able to work out is how to set
that threshold. It showed up in the new immutable downloader: I've got
an event type ("OVERDUE") and code that reacts to it by preferring
alternate servers, but no code to emit that event, since I don't know
what a good threshold ought to be. I know it needs to be adaptive, not
hard-wired.

The issue is muddied by all the layers of the TCP stack. If the problem
is down at the NAT level (router, wireless, etc), then 10x or 100x the
TCP RTT time would probably be safe: if we see no TCP ACK in that time,
conclude that the connection is lost. But how can we measure that? (both
the RTT time, and the elapsed time-since-ACK).

The closest we can get (with a single connection) to measuring the RTT
is the response time to a short Foolscap RPC message, but that's going
to be delayed by head-of-line blocking (in both directions) and other
traffic. Maybe the shortest-recorded RPC-level RTT we have for the last
hour, and hope that this will include at least one uncontested call?
It'd be nice if sockets had a "report your low-level RTT time" method.
As well as a "report how long you've had unACKed data" method: we can't
measure that from userspace either. The closest we can get is
.dataLastReceivedAt (which is why I went for the syscall-heavy
approach-1 above, so this information would be available to
applications). This isn't bad, but is still quantized by the TCP rx
buffers.

Overall, the issue is one of resource optimization tradeoffs. To
minimize network traffic, never ever send a duplicate packet: just wait
patiently until TCP gives up. Or, to react quickly to lost connections
and speed downloads, send lots of duplicate packets, even if the extra
traffic slows down other users or you accidentally give up on a
live-but-slow connection.

>   So on balance I favor either no keepalive or a several hour default
>   keepalive, switching to 270s for any peer that has a stall, with the
>   270s being sticky for a week.  Optionally a node can further put 270s
>   on all peers, if at least half the peers have had stalls within a few
>   hours.

Nice!

>   It's interesting to consider not keeping connections open, but
>   instead opening them on demand and closing after say 1m of idle
>   time. There's a slight latency hit, but it should avoid a lot of
>   issues and result in a lot fewer standing connections.

Yeah, I suspect that when we get to tahoe-over-HTTP, it'll end up
working that way. Sort of lowest-common-denominator, but that's
practicality-over-purity for you. For data-transfer purposes, you really
want to leave that connection nailed up (otherwise you lose a lot of
bandwidth waiting for TCP to ratchet up the window size), but of course
that gets you right back in the same boat.

>   This sort of implies that only the forward direction is used for RPC
>   origination, but I think that's a good thing, because backwards RPCs
>   hide connectivity problems - while things work better on average
>   they are harder to debug.

I've benefitted from the backwards-RPC thing: one server living behind
NAT was able to establish outbound connections to a public-IP client
(which pretended to be a server with no space available), allowing that
client to use the server normally. But I agree that it's surprising and
confusing, so I'd be willing to give it up. Especially if we could make
some progress on Relay or STUN or UPnP or something.

>   How's IPv6 coming?

Still waiting on Twisted, but I heard them making great progress at the
recent PyCon sprints, so I think things are looking good. I don't think
Foolscap has too much to do once Twisted handles it (code to call
connectTCP6 or however they spell it, maybe some changes to
address-autodetection), ditto for Tahoe (some web renderers probably
need updating).

cheers,
 -Brian


More information about the tahoe-dev mailing list