#3038 closed defect (fixed)

Many Python 2.7 macOS builds fail with "Too many open files"

Reported by: exarkun Owned by: exarkun
Priority: normal Milestone: undecided
Component: unknown Version: 1.12.1
Keywords: Cc:
Launchpad Bug:

Description

An example from a run on Travis (but note buildbot macOS fails the same way):

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", lin
-- TRACEBACK ELIDED --
/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/broker.py", line 579, in <lambda>
    
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/broker.py", line 614, in _doCall
    
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/referenceable.py", line 59, in doRemoteCall
    
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/storage/server.py", line 300, in remote_allocate_buckets
    
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/storage/immutable.py", line 209, in __init__
    self._sharefile = ShareFile(incominghome, create=True, max_size=max_size)
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/storage/immutable.py", line 53, in __init__
    f = open(self.home, 'wb')
exceptions.IOError: [Errno 24] Too many open files: u'/Users/travis/build/tahoe-lafs/tahoe-lafs/_trial_temp/system/SystemTest/test_upload_and_download_convergent/client2/storage/shares/incoming/v7/v7tjbaysgdl3br6ymaek73z5sa/5'
]'>
])
]'>
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/call.py", line 679, in _done
    
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/foolscap/call.py", line 60, in complete
    
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 460, in callback
    self._startRunCallbacks(result)
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/.tox/py27/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/travis/build/tahoe-lafs/tahoe-lafs/src/allmydata/immutable/offloaded.py", line 345, in _start_reading
    
exceptions.IOError: [Errno 24] Too many open files: u'/Users/travis/build/tahoe-lafs/tahoe-lafs/_trial_temp/system/SystemTest/test_upload_and_download_random_key/client0/helper/CHK_incoming/hozcimqmlsb7gxruhdbi3gaz4a'
]'>
allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
-------------------------------------------------------------------------------

This seems to have begun around the middle of April.

Change History (14)

comment:1 Changed at 2019-05-01T11:43:36Z by exarkun

  • Owner set to exarkun

comment:2 Changed at 2019-05-01T12:42:09Z by exarkun

Reproducible on the macOS machine (macstadium) I have, hooray.

comment:3 Changed at 2019-05-01T12:42:31Z by exarkun

allmydata.test.test_system.SystemTest?.test_upload_and_download_convergent doesn't fail in isolation.

comment:4 Changed at 2019-05-01T12:45:54Z by exarkun

allmydata.test.test_system.SystemTest? doesn't fail in isolation.

comment:5 Changed at 2019-05-01T12:49:56Z by exarkun

allmydata.test.test_system doesn't fail in isolation

comment:6 Changed at 2019-05-01T12:59:07Z by exarkun

It's not 100% deterministic when running the whole test suite. Sometimes the tests pass.

comment:7 Changed at 2019-05-01T13:08:45Z by exarkun

Note the max open files limit on macOS defaults to 256 and using lsof to inspect the test process when it gets to test_system shows that it is awfully close yo this (lsof says max open fd number is 255! didn't read through all the other fds to see if there are gaps).

comment:8 Changed at 2019-05-01T13:23:55Z by exarkun

It seems more reproducible using tox than running trial directory. Two tox runs both encountered it, two direct trial runs both passed.

Also, using some grep magic with lsof, I see that allmydata.test.test_system.SystemTest??.test_upload_and_download_convergent seems to want to open up around 150 fds. So if there are already ~100 open then it's going to run into the macOS limit.

comment:9 Changed at 2019-05-01T14:41:02Z by exarkun

I guess ... maybe related ... allmydata.test.test_client leaks file descriptors.

comment:10 Changed at 2019-05-01T14:47:03Z by exarkun

It looks like the leaked descriptors are the dup'd descriptors passed to AdoptedStreamServerEndpoint in _foolscapEndpointForPortNumber.

comment:11 Changed at 2019-05-01T14:53:53Z by exarkun

If AdoptedStreamServerEndpoint is never told to listen then it will never convert its integer file descriptor into a listening port. If the tub that is passed to listenOnUnused is never started then the endpoint will never be told to listen.

A bunch of tests seem to create a client but never start it. In this case, the tub is never started and the descriptor leaks.

comment:12 Changed at 2019-05-01T17:08:35Z by exarkun

I tried making a wrapper for the endpoint with a del that closes the file descriptor if it needs to be closed (if listen hasn't been called). This didn't help because apparently the garbage collector doesn't run frequently enough to clean anything up before file descriptors run out (this makes sense, you only have to leak a hundred or so file descriptors to cause a problem and the GC only wants to run after a lot more has happened than that, I expect (CPython GC is driven by allication/deallocation from Python memory arenas, not file descriptors)).

I expect a weakref-based solution to have the same problem since weakrefs are driven by the CPython GC.

Possibly this issue can be worked around by inserting an explicit gc.collect() call into this port allocation code path. This would tie collection frequency to port allocation frequency which should let us keep on top of the stale resources.

comment:13 Changed at 2019-05-13T10:33:14Z by exarkun

I added some gc.collect calls at carefully chosen places/times. Perhaps this will work. Waiting on CI (while also trying to sift through and fix all the _other_ failures we currently have).

comment:14 Changed at 2019-05-14T10:13:52Z by GitHub <noreply@…>

  • Resolution set to fixed
  • Status changed from new to closed

In bee18e1/trunk:

Merge pull request #610 from tahoe-lafs/3038.stop-leaking-file-descriptors

Stop leaking file descriptors

Fixes: ticket:3038

Note: See TracTickets for help on using tickets.