#2342 new defect

Too many open files

Reported by: zooko Owned by:
Priority: normal Milestone: undecided
Component: code Version: 1.10.0
Keywords: Cc:
Launchpad Bug:

Description

I'm working with a client company that says they have hundreds of thousands of people inconvenienced by their Tahoe-LAFS installation failing. Initial investigation reveals this in the twistd.log of their gateway node:

2014-11-20 13:33:36+0530 [HTTPChannel,143180,192.168.51.230] Unhandled Error
        Traceback (most recent call last):
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 88, in cal
lWithLogger
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 73, in cal
lWithContext
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, i
n callWithContext
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in
 callWithContext
            
        --- <exception caught here> ---
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/posixbase.py", line 61
4, in _doReadOrWrite
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 215, in 
doRead
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 221, in 
_dataReceived
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/protocols/basic.py", line 581, 
in dataReceived
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 1609, in lin
eReceived
            
HeadersReceived

          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 694, in gotLength

          File "/usr/lib64/python2.6/tempfile.py", line 475, in TemporaryFile

          File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner

        exceptions.OSError: [Errno 24] Too many open files: '/GFS2_tahoe/.tahoe-filestore1/tmp/tmpmKDgf6'

Change History (5)

comment:1 Changed at 2014-11-27T16:53:59Z by zooko

Here's a branch to tweak the increase_rlimits() code and to printout what it does for diagnostics: https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py

comment:2 Changed at 2014-11-28T17:39:53Z by zooko

My consulting client (codenamed "WAG") reported this result on their RHEL server:

# python iputil.py v110
HELLO WORLD 000
HELLO WORLD 011
012 getrlimit(RLIMIT_NOFILE) before setrlimit: (1024, 4096)

Also this tweep ran it on CentOS and reported that with the patch (https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py) the limit was raised to 4096:

https://twitter.com/brouhaha/status/538085487622107136

CentOS 6.5: before setrlimit: (1024, 4096) ... after setrlimit(4096, 4096): (4096, 4096)

comment:3 Changed at 2014-11-28T17:48:30Z by zooko

See also #1794, #812, and #1278. I currently believe the underlying problem has to do with bad handling of corrupted shares (per comment:7:ticket:812).

comment:4 Changed at 2014-11-29T01:55:06Z by warner

Seems like a decent theory.. you might add a timed loop that counts/logs the number of allmydata.immutable.upload.Uploader instances, and/or upload.FileHandle+subclasses (specifically FileName). If an upload gets wedged and stops making any progress, it will hold a filehandle open forever, and eventually an open() will fail like that.

You might also lsof the client in question and see what filehandles it has open: if it's this problem, there'll be a lot of /tmp files in the list, recently opened by HTTP uploads but not yet pushed out to the grid.

comment:5 Changed at 2014-11-29T17:20:56Z by warner

Also, it might be appropriate to add a failsafe timer to Uploaders, something that fires once every minute or five minutes, and checks to see if any progress has been made, and self-destructs if not. We don't like heuristics, but sometimes they're a good hedge against weird unpredictable things happening.

Note: See TracTickets for help on using tickets.