Context Navigation

regression coincident with iputil fixes, on FreeBSD and Slackware

Reported by:	zooko	Owned by:	daira
Priority:	normal	Milestone:	1.10.1
Component:	code-network	Version:	1.10.0
Keywords:	regression portability iputil blocks-release	Cc:
Launchpad Bug:

Description (last modified by zooko)

Our buildbot shows that when this batch of patches went in:

[f77ec470d75f4b8fb81b1abca4ee3b73f1ad8b22], [8e31d66cd0b0821ccaa2c7c259e7d6f262ad4738], [6a445d73bc5253ec4ae0dec70af02e33bc869cf6], [d85a75d7f689cb55ecddb319dc2057f38e4db87a]

Then the FreeBSD and Slackware builders went from green to red. See this historical waterfall view:

https://tahoe-lafs.org/buildbot-tahoe-lafs/waterfall?builder=Markus+slackware64+current&builder=Randy+FreeBSD-amd64&reload=none&last_time=1372000000

Out of those four patches, I don't believe that [d85a75d7f689cb55ecddb319dc2057f38e4db87a] is relevant, but the other three could be. Especially note that the error message on the FreeBSD and Slackware builders is "twisted.trial.unittest.FailTest: Errors snooped, terminating early", and that [f77ec470d75f4b8fb81b1abca4ee3b73f1ad8b22] adds a call to log.err. If that call to log.err is getting called, then that would cause the tests to fail in this way.

Change History (20)

comment:1 Changed at 2013-07-16T15:56:00Z by zooko

Description modified (diff)

comment:2 Changed at 2013-07-17T15:10:13Z by daira

I'd already tried changing the log.err to log.msg, so it is more complicated than that.

comment:3 Changed at 2013-07-17T15:11:15Z by daira

Status changed from new to assigned

comment:4 Changed at 2013-08-28T15:37:20Z by zooko

Milestone changed from soon to 1.11.0

comment:5 Changed at 2013-08-28T16:03:47Z by daira

Keywords blocks-release added

comment:6 Changed at 2013-12-02T14:00:07Z by zooko

#2121 may be a duplicate of this on markberger's development system.

comment:7 Changed at 2014-09-02T16:43:51Z by daira

Owner changed from daira to warner
Status changed from assigned to new

warner will get in touch with the buildslave owners and markberger.

comment:8 Changed at 2014-09-02T18:17:48Z by warner

Status changed from new to assigned

I think I might have reproduced this locally (at least using the stock OS-X 10.9 python-2.7.5 gives me an "Address already in use" error during test_system.SystemTest.test_filesystem, which the Homebrew python-2.7.8 does not).

I'll investigate. I might back out the [f77ec4] patch (the one that avoids dropping the Deferred).. I vaguely remember thinking about the differences between the normal tahoe run environment and the unit test environment that could be significant.

Version 0, edited at 2014-09-02T18:17:48Z by warner (next)

comment:9 follow-up: ↓ 11 Changed at 2014-09-03T04:52:50Z by warner

.. and now I can't reproduce it, because using the OS-X python results in an "error: Installed distribution Twisted 12.2.0 conflicts with requirement twisted>=13.0". I know OS-X is providing that 12.2.0 Twisted, but I don't know what's insisting upon >=13.0 (I don't think it's tahoe). Did I mention that I really hate these sorts of packaging bugs?

comment:10 Changed at 2014-09-03T05:31:55Z by warner

Oh, and now I can, by editing _auto_deps.py to require twisted >=13.0 . (maybe this causes twisted to be installed into support/lib/ before whatever other mysterious thing gets installed that demands >=13.0 but is unable to install it).

The cluster of failures I'm getting includes:

close failed in file object destructor: IO Error: [Errno 9] Bad file descriptor. This appears to be happening in `node.Node._setup_tub > fileutil.write_atomically`, just after calling iputil.get_local_addresses_async(), as it tries to write the kernel-assigned port number to disk. In one case, this caused an exception to be caught by the node-startup-time Deferred chain, which then bails (os.abort) via the Node._startService failed, aborting path. In another case, this didn't actually flunk the test.

twisted.internet.error.CannotListenError: Couldn't listen on any:64198: [Errno 48] Address already in use, in SystemTestMixin.bounce_client as it tries to start up a new Client service, just after shutting down the old one (and waiting for the disownServiceParent deferred to fire, then waiting an extra 1.0 seconds for good measure). The arbitrary 1.0 second stall already smells funny (I left a note there blaming windows, but I'm seeing this problem on OS-X too).

CannotListenError, on any:0, with [Errno 9] Bad file descriptor, in `get_local_addresses_async > get_local_ip_for > listenUDP > startListening > _bindSocket`, again with the "close failed in file object destructor" message, and triggering the `Node._startService failed, aborting` path.

exceptions.OSError: [Errno 9] Bad file descriptor on a call to os.urandom() inside Foolscap. I think this might be collateral damage due to --rterrors or the bail-on-failure stuff, when the test fails, but part of the code charges on ahead without realizing it, and then you've got one thread closing all fds in preparation for shutdown, and a different thread trying to use those fds.

sometimes combinations of these errors

I also see "Malformed file descriptor found. Preening lists." in the logs, which happens when select() gets an error (ValueError, TypeError, or our old friend Bad File Descriptor).

Exceptions that occur during object destructors are always screwy (actually anything that happens inside a destructor call is screwy). They're a concurrency hazard that's worse than threads: at least with threads you can pretend to fix the problem with locks. But it seems like *something* is triggering a bunch of Bad File Descriptor errors in random places.

Hm, I know Twisted has had, at various times, a feature to close extra fds when preparing for a fork(), and I think that code got simplified or changed recently (last two years?) to take advantage of some feature that lets you mark fds for automatic closing instead of manually calling os.close() on them. Maybe something in a newer version of python? I'm wondering if the thread that gets started when Twisted's DNS resolver is created (the one on which the blocking gethostbyname() is called), or the fork/exec that might happen when iputil.py spawns off ifconfig, is causing existing fds to be killed, wreaking all sorts of havoc.

Last edited at 2014-09-10T00:43:40Z by warner (previous) (diff)

comment:11 in reply to: ↑ 9 Changed at 2014-09-04T13:06:56Z by zooko

Replying to warner:

.. and now I can't reproduce it, because using the OS-X python results in an "error: Installed distribution Twisted 12.2.0 conflicts with requirement twisted>=13.0". I know OS-X is providing that 12.2.0 Twisted, but I don't know what's insisting upon >=13.0 (I don't think it's tahoe). Did I mention that I really hate these sorts of packaging bugs?

FWIW, this is #2249. Everyone hates these sorts of packaging bugs. :-(

comment:12 Changed at 2014-09-05T04:57:59Z by warner

FYI, I'm not getting the failures on linux (using python-2.7.5 nor 2.7.8). Bummer, because I've got more tools to investigate things there (I was thinking of strace'ing the process and watching for the close() calls).

comment:13 Changed at 2014-09-11T00:45:10Z by warner

I've got a fix: https://github.com/tahoe-lafs/tahoe-lafs/pull/108

The core issue is that python-2.7.4, 2.7.5, 2.7.6, and 2.7.7 suffer from python bug 18851 (http://bugs.python.org/issue18851), in which the stdlib 'subprocess' module closes most/all (unrelated) file descriptors when subprocess.call() fails the exec(), such as when the executable being invoked does not actually exist. There appears to be some randomness involved: it happened at different times during my tests. This was fixed in python-2.7.8. Tahoe's iputil.py uses subprocess.call on many different "ifconfig"-type executables, most of which don't exist on any given platform (git commit 8e31d66). This results in a lot of random file-descriptor closing, which (at least during unit tests) tends to clobber important things like Tub TCP sockets. This seems to be the root cause behind #2121, in which normal code tries to close already-closed sockets, gets an EBADF ("not a socket") during node setup, and bails with os.abort(). Since different platforms have different ifconfigs, some platforms will experience more failed execs than others, so this bug could easily behave differently on linux vs freebsd, as well as working normally on python-2.7.8 or 2.7.4.

My proposed fix is to switch to the 'subprocess32' module from PyPI, which is a backport of the newer 'subprocess' from python3's stdlib. In python issue 18851, Gregory P. Smith recommends subprocess32 for all python2 users who would normally use the stdlib version, since it does not suffer from this bug (and has other bugfixes too). I'm pretty sure this is the fix for #2121. We could also fix it by requiring python >= 2.7.8, but that would rule out development and deployment on the current OS-X release 10.9 (which ships with python-2.7.5), as well as other platforms. Using subprocess32 seems like the easiest fix.

I don't know what exactly this ticket (#2023) is doing, since we no longer have the logfiles that prompted it (note to future bug-reporters: please copy the relevant part of the logs into the ticket, rather than relying upon long-term access to buildbot logs). But since some of us apparently believed that #2121 might be a duplicate, and since it's plausible that this problem could appear on FreeBSD and Slackware from that era and not on linux or other python versions, I'm willing to bet that this change will fix #2023 too.

comment:14 Changed at 2014-09-11T00:45:26Z by warner

Keywords review-needed added
Owner changed from warner to daira
Status changed from assigned to new

comment:15 Changed at 2014-09-11T21:09:12Z by warner

Incidentally, I was able to induce the EBADF failures on linux by shuffling the order of _unix_commands in iputil.py, to put /usr/sbin/ifconfig and /usr/etc/ifconfig first. The normal ordering puts /bin/ip first, which is present on linux and thus short-circuits the function, so it never tries to run a program that isn't there. /usr/sbin/ifconfig doesn't exist on linux (it lives in /sbin), so this triggers the subprocess bug.

comment:16 Changed at 2014-09-11T21:12:30Z by warner

Keywords review-needed removed
Owner changed from daira to warner
Status changed from new to assigned

Daira and I decided that it'd be slightly easier to fix this by dodging the bug: modify iputil.py (and other code that uses subprocess) to check for the existence of the executable before allowing subprocess.Popen to try and run it. This seems to work, and avoids adding a new dependency. I'll file a PR with this change in a moment. Hopefully this will tide us over until 1.11 is released, and we can switch to a whole new way of managing local IP addresses (probably by having the Introducer tell us ala #50 or #2152, or requiring server admins to configure them explicitly).

Last edited at 2014-09-22T20:02:34Z by warner (previous) (diff)

comment:17 Changed at 2014-09-12T04:28:07Z by warner

Keywords needs-review added
Owner changed from warner to daira
Status changed from assigned to new

I've got a new patch up: https://github.com/tahoe-lafs/tahoe-lafs/pull/109 , which uses os.path.isfile() instead of subprocess32.

comment:18 Changed at 2014-09-12T07:08:57Z by warner

Yay: the Slackware buildslave passes with this patch, and was failing with EBADF before:

https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/Markus%20slackware64%20stable/builds/17

comment:19 Changed at 2014-09-13T09:44:06Z by daira

Status changed from new to assigned

Reviewing.

comment:20 Changed at 2014-09-13T18:07:09Z by warner

Keywords needs-review removed
Resolution set to fixed
Status changed from assigned to closed

Thanks!

Landed in [d27a57cb49], merged in [d91170a7acd], although for some reason the auto-notify hook didn't see fit to add a comment here.

Note: See TracTickets for help on using tickets.

Download in other formats: