[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2023: regression coincident with iputil fixes, on FreeBSD and Slackware
Tahoe-LAFS
trac at tahoe-lafs.org
Wed Sep 3 05:31:55 UTC 2014
#2023: regression coincident with iputil fixes, on FreeBSD and Slackware
-------------------------+-------------------------------------------------
Reporter: zooko | Owner: warner
Type: defect | Status: assigned
Priority: normal | Milestone: 1.11.0
Component: code- | Version: 1.10.0
network | Keywords: regression portability iputil
Resolution: | blocks-release
Launchpad Bug: |
-------------------------+-------------------------------------------------
Comment (by warner):
Oh, and now I can, by editing `_auto_deps.py` to require twisted >=13.0 .
(maybe this causes twisted to be installed into support/lib/ before
whatever other mysterious thing gets installed that demands >=13.0 but is
unable to install it).
The cluster of failures I'm getting includes:
* `close failed in file object destructor: IO Error: [Errno 9] Bad file
descriptor`. This appears to be happening in `node.Node._setup_tub >
fileutil.write_atomically`, just after calling
`iputil.get_local_addresses_async()`, as it tries to write the
kernel-assigned port number to disk. In one case, this caused an
exception to be caught by the node-startup-time Deferred chain, which
then bails (os.abort) via the `Node._startService failed, aborting`
path. In another case, this didn't actually flunk the test.
* `twisted.internet.error.CannotListenError: Couldn't listen on
any:64198: [Errno 48] Address already in use`, in
`SystemTestMixin.bounce_client` as it tries to start up a new Client
service, just after shutting down the old one (and waiting for the
`disownServiceParent` deferred to fire, then waiting an extra 1.0
seconds for good measure). The arbitrary 1.0 second stall already
smells funny (I left a note there blaming windows, but I'm seeing this
problem on OS-X too).
* `CannotListenError`, on `any:0`, with `[Errno 9] Bad file descriptor`,
in `get_local_addresses_async > get_local_ip_for > listenUDP >
startListening > _bindSocket`, again with the "close failed in file
object destructor" message, and triggering the `Node._startService
failed, aborting` path.
* `exceptions.OSError: [Errno 9] Bad file descriptor` on a call to
`os.urandom()` inside Foolscap. I think this might be collateral
damage due to --rterrors or the bail-on-failure stuff, when the test
fails, but part of the code charges on ahead without realizing it, and
then you've got one thread closing all fds in preparation for
shutdown, and a different thread trying to use those fds.
* sometimes combinations of these errors
* I also see "Malformed file descriptor found. Preening lists." in the
logs, which happens when `select()` gets an error (`ValueError`,
`TypeError`, or our old friend Bad File Descriptor).
Exceptions that occur during object destructors are always screwy
(actually anything that happens inside a destructor call is screwy).
They're a concurrency hazard that's worse than threads: at least with
threads you can pretend to fix the problem with locks. But it seems like
*something* is triggering a bunch of Bad File Descriptor errors in random
places.
Hm, I know Twisted has had, at various times, a feature to close extra fds
when preparing for a fork(), and I think that code got simplified or
changed recently (last two years?) to take advantage of some feature that
lets you mark fds for automatic closing instead of manually calling
os.close() on them. Maybe something in a newer version of python? I'm
wondering if the thread that gets started when Twisted's DNS resolver is
created (the one on which the blocking `gethostbyname()` is called), or
the fork/exec that might happen when iputil.py spawns off ifconfig, is
causing existing fds to be killed, wreaking all sorts of havoc.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2023#comment:10>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list