[tahoe-dev] connection timeouts on VG2

Mon Mar 26 06:02:48 UTC 2012

With the help of the volunteergrid2 crew, I've been investigating
reports of unusual multi-minute delays in Tahoe operations with a packet
sniffer.

One thing I found was a server connection that was was dropping and
reconnecting about once every 16 minutes. I noticed this by looking at
the "Since" column on the welcome page's server list: it shows a
timestamp of a few seconds after node reboot for most servers, but that
one server showed a fairly recent timestamp, changing every once in a
while.

I believe this is due to a NAT/router entry timing out on the server's
side (because I didn't observe this with any of the other servers),
after which inbound packets on the now-stale TCP connection are silently
dropped. When this happens, the only way for the two endpoints to
discover the connection is no longer viable is to have TCP declare a
timeout, which only occurs when an unacknowledged TCP message has been
retransmitted too many times, which can take several minutes.

(as an aside, this behavior of NAT appliances is really annoying. It has
good intentions, namely protecting the appliance's limited memory and
table space against lazy/broken TCP stacks and laptops which are
abruptly disconnected without properly terminating their TCP
connections. But the overall effect is to create a world in which
short-lived HTTP connections are the only thing that actually works,
which is inefficient and limiting).

Foolscap has a built-in keepalive PING/PONG-sending timer to help with
two things: the first goal is to convince the firewall/NAT/router box
that this TCP socket really is in use, and the second is to speed up
discovery of an silently-lost connection. Foolscap's default timer is 4
minutes, which means (due to the low-overhead way it's implemented) you
can expect it to send a PING after about 8 minutes of idle time. And
indeed, I observed a TCP message sent at T+8min, which was not ACKed, so
my client's TCP stack retransmitted it at exponentially-increasing
intervals, until it tried again every 64 seconds. At 448s after the
first unACKed PING, TCP gave up on the connection, causing Foolscap to
reconnect (which completed in about two seconds).

So at some point during the first 8 minutes of my TCP connection to that
server, I think their router decided that this connection was idle and
should be dropped. If we guess that this happened at T+5min, and given
that it takes about 15min before Foolscap and TCP get wise to the
problem, then we've got a lame connection about 2/3rds of the time,
where Tahoe thinks it has a connection but any attempt to actually use
it will result in a stall of up to 10 minutes followed by a
disconnection.

I was able to fix this in my tahoe.cfg: [node]timeout.keepalive=120 .
That sets the timer to 2 minutes, so a PING happens within 4 minutes of
idle time. After doing that, I had several hours of uninterrupted
connection to the server. I'm guessing that the router is configured for
5-minute "idle" connection timeouts, so every 4 minutes is enough to
keep the entry alive. Note that either end can set this value and
achieve the same first goal of keeping the entry alive.

I haven't yet investigated how this affects the various operations. I
know the (new) immutable downloader should be mostly immune to this: a
server that responds slowly to the DYHB query will be ignored in favor
of faster servers (although a silent-close between the DYHB phase and
the main download will still cause stalls until we implement the
"impatience" timer). I think immutable uploads will stall until the
connection is closed, then will continue properly. I'm guessing that
mutable retrieve/publish will stall as well. These stalls are all things
we want to fix, of course.. in some cases it's fairly straightforward,
in others it becomes a deep question about the CAP theorem.

So in summary:

CONDITION NAME: NAT-Induced Silent Close Stalls (NISCS)

SYMPTOMS: Tahoe operations hang for 1-16 minutes before
          completing, sometimes with an error condition,
          sometimes with success.

ROOT CAUSE: NAT boxes with short idle timeouts, usually on the client
            side but can also occur in routers/firewall boxes on the
            server side. Complicated by incomplete Tahoe code for
            detecting slow servers and switching to alternates.

FIXES: 1: fix NAT/firewall configuration to <begin rant>STOP BLOODY
          SECOND GUESSING TCP<end rant> ahem increase the "idle
          connection" timeout to at least 10 minutes
       2: set [node]timeout.keepalive=120 (or less than half the
          NAT/firewall box's idle-connection timeout) on all clients
       3: set [node]timeout.keepalive=120 (or less than half the
          NAT/firewall's idle-connection timeout) on the client or
          server behind the impatient NAT/firewall box

IMPROVED DIAGNOSTICS:

  I'm thinking of adding a little SQLite database to record per-server
  connection/disconnection events. Then the Welcome page could show a
  warning for any server that has been offline at all in the last hour,
  and a details page could show the complete history. Continuous uptime
  is just as important a server virtue as capacity and bandwidth, so
  it'd be nice to surface it somewhere.

BTW: This doesn't explain all the problems that have been seen, so I'm
still investigating.

cheers,
 -Brian