[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2787: intermittent "Address Already In Use" error during tests
Tahoe-LAFS
trac at tahoe-lafs.org
Wed May 23 13:39:52 UTC 2018
#2787: intermittent "Address Already In Use" error during tests
------------------------------+--------------------
Reporter: warner | Owner: warner
Type: defect | Status: closed
Priority: normal | Milestone: soon
Component: code-network | Version: 1.11.0
Resolution: wontfix | Keywords:
Launchpad Bug: |
------------------------------+--------------------
Changes (by exarkun):
* status: new => closed
* resolution: => wontfix
Comment:
It's not possible to fix this inside `allocate_tcp_port` itself. So I'm
planning to close this ticket. Instead, we'll have a ticket for each test
which can fail this way and they'll have to be fixed one by one.
The reason we cannot fix this inside `allocate_tcp_port` is that the
approach it is a component of is suffers from an unavoidable race
condition. `allocate_tcp_port` tries to figure out a specific TCP port
number which _will not be in use at a later point in time_. Since there
is no part of the system which allows the port number to be reserved or
otherwise kept out of us *except by the one piece of code we intend* it
cannot actually know whether any port number it selects will satisfy this
requirement.
In practice, it does succeed with high probability. However, due to the
large number of cases in which it is used (many times per test suite run
and the test suite itself is run many times), even this high probability
of success is not good enough. I will make an incredibly naive estimate
that there are 2^15^ ports available for "random" assignment and that the
chance of an unrelated intermediate assignment being made is about 1 in 2
(I suspect some tests themselves trigger an unrelated intermediate port
assignment). The chance of a collision is therefore 1 in 2^16^ (around a
thousandth of a percent). If there are 100 users of `allocate_tcp_port`
in the test suite then the chance of a collision anywhere in the test
suite is 100 in 2^16^. There are about 15 different CI runners of the
test suite. So the chance of a failure on any of them for one build set
is 15 * 100 in 2^16^. The test suite is run for every pull request and
every master revision. If there is one PR merged a day, the chance of a
failure in a week is at least 14 * 15 * 100 in 2^16^ which reduces to
around 32%. Quite easily high enough to be disruptive to development.
There are several possible general fixes for this issue.
1. Add retry logic. If a test randomly allocates a port and then
discovers it cannot bind that port, just try the whole process over again.
A small number of retries should be able to drive the failure rate down
dramatically (the chance of success of each try should be independent; if
the chance of failure of 1 try is a thousandth of a percent, the chance of
failure of 3 tries is the cube of that - under a billionth of a percent).
This solution is conceptually simple but the implementation might not be
so. Detecting the failure (asynchronously, often across process
boundaries) and backing up to a point where a retry may be made will
probably take a lot of effort.
2. Switch to pre-allocated sockets. Note that `allocate_tcp_port` is
really trying to allocate a TCP port number. If it allocated a bound TCP
socket (perhaps marked as listening) and this socket were handed to
application code, there is no possibility for a collision in the
application code because there is no longer any need to bind there. There
is still the possibility for a collision inside the allocation function
but it is much reduced compared to the current situation and it is much
more amenable to the addition of retry logic. The most likely downside to
this approach is lack of support for the underlying operation on Windows.
3. Switch to UNIX sockets. It's much easier to avoid collisions with UNIX
sockets. When using TCP we are working with only 2^15^ possible values,
they are assigned roughly randomly, and we compete with all other users of
the system for them. When using UNIX, we have at least 255^108^ possible
values, we can allocate them with structure that inherently avoids self-
collision, and we need not compete with anyone else on the system.
However, UNIX sockets are not necessarily compatible with all of the
components which need to accept connections (for example, their "socket
name" necessarily differs from TCP/IPv4; and being inherently private,
there is less support in tools like HTTP clients for accessing them).
4. Reverse the allocation relationship. Let the application code randomly
allocate a port number. Arrange for the test code to somehow learn of the
allocated value. As with option (2), this dramatically reduces the
possibility for a collision and makes it significantly easier to add retry
logic at the point where that collision may occur. In contrast to (2), it
may require implementation of this allocation and retry logic at multiple
code sites. There is also the matter of conveying the allocated port
number back to the test code which probably also requires several
different implementations.
Considering all of these, (2) is my preference. However, there is the
matter of Windows support to contend with in that case.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2787#comment:1>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list