[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2787: intermittent "Address Already In Use" error during tests

Wed May 23 13:39:52 UTC 2018

#2787: intermittent "Address Already In Use" error during tests
------------------------------+--------------------
     Reporter:  warner        |      Owner:  warner
         Type:  defect        |     Status:  closed
     Priority:  normal        |  Milestone:  soon
    Component:  code-network  |    Version:  1.11.0
   Resolution:  wontfix       |   Keywords:
Launchpad Bug:                |
------------------------------+--------------------
Changes (by exarkun):

 * status:  new => closed
 * resolution:   => wontfix

Comment:

 It's not possible to fix this inside `allocate_tcp_port` itself.  So I'm
 planning to close this ticket.  Instead, we'll have a ticket for each test
 which can fail this way and they'll have to be fixed one by one.

 The reason we cannot fix this inside `allocate_tcp_port` is that the
 approach it is a component of is suffers from an unavoidable race
 condition.  `allocate_tcp_port` tries to figure out a specific TCP port
 number which _will not be in use at a later point in time_.  Since there
 is no part of the system which allows the port number to be reserved or
 otherwise kept out of us *except by the one piece of code we intend* it
 cannot actually know whether any port number it selects will satisfy this
 requirement.

 In practice, it does succeed with high probability.  However, due to the
 large number of cases in which it is used (many times per test suite run
 and the test suite itself is run many times), even this high probability
 of success is not good enough.  I will make an incredibly naive estimate
 that there are 2^15^ ports available for "random" assignment and that the
 chance of an unrelated intermediate assignment being made is about 1 in 2
 (I suspect some tests themselves trigger an unrelated intermediate port
 assignment).  The chance of a collision is therefore 1 in 2^16^ (around a
 thousandth of a percent).  If there are 100 users of `allocate_tcp_port`
 in the test suite then the chance of a collision anywhere in the test
 suite is 100 in 2^16^.  There are about 15 different CI runners of the
 test suite.  So the chance of a failure on any of them for one build set
 is 15 * 100 in 2^16^.  The test suite is run for every pull request and
 every master revision.  If there is one PR merged a day, the chance of a
 failure in a week is at least 14 * 15 * 100 in 2^16^ which reduces to
 around 32%.  Quite easily high enough to be disruptive to development.

 There are several possible general fixes for this issue.

 1. Add retry logic.  If a test randomly allocates a port and then
 discovers it cannot bind that port, just try the whole process over again.
 A small number of retries should be able to drive the failure rate down
 dramatically (the chance of success of each try should be independent; if
 the chance of failure of 1 try is a thousandth of a percent, the chance of
 failure of 3 tries is the cube of that - under a billionth of a percent).
 This solution is conceptually simple but the implementation might not be
 so.  Detecting the failure (asynchronously, often across process
 boundaries) and backing up to a point where a retry may be made will
 probably take a lot of effort.

 2. Switch to pre-allocated sockets.  Note that `allocate_tcp_port` is
 really trying to allocate a TCP port number.  If it allocated a bound TCP
 socket (perhaps marked as listening) and this socket were handed to
 application code, there is no possibility for a collision in the
 application code because there is no longer any need to bind there.  There
 is still the possibility for a collision inside the allocation function
 but it is much reduced compared to the current situation and it is much
 more amenable to the addition of retry logic.  The most likely downside to
 this approach is lack of support for the underlying operation on Windows.

 3. Switch to UNIX sockets.  It's much easier to avoid collisions with UNIX
 sockets.  When using TCP we are working with only 2^15^ possible values,
 they are assigned roughly randomly, and we compete with all other users of
 the system for them.  When using UNIX, we have at least 255^108^ possible
 values, we can allocate them with structure that inherently avoids self-
 collision, and we need not compete with anyone else on the system.
 However, UNIX sockets are not necessarily compatible with all of the
 components which need to accept connections (for example, their "socket
 name" necessarily differs from TCP/IPv4; and being inherently private,
 there is less support in tools like HTTP clients for accessing them).

 4. Reverse the allocation relationship.  Let the application code randomly
 allocate a port number.  Arrange for the test code to somehow learn of the
 allocated value.  As with option (2), this dramatically reduces the
 possibility for a collision and makes it significantly easier to add retry
 logic at the point where that collision may occur.  In contrast to (2), it
 may require implementation of this allocation and retry logic at multiple
 code sites.  There is also the matter of conveying the allocated port
 number back to the test code which probably also requires several
 different implementations.

 Considering all of these, (2) is my preference.  However, there is the
 matter of Windows support to contend with in that case.

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2787#comment:1>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage