Opened at 2014-11-27T16:11:00Z
Last modified at 2014-11-29T17:20:56Z
#2342 new defect
Too many open files
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | code | Version: | 1.10.0 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
I'm working with a client company that says they have hundreds of thousands of people inconvenienced by their Tahoe-LAFS installation failing. Initial investigation reveals this in the twistd.log of their gateway node:
2014-11-20 13:33:36+0530 [HTTPChannel,143180,192.168.51.230] Unhandled Error Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 88, in cal lWithLogger File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 73, in cal lWithContext File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, i n callWithContext File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext --- <exception caught here> --- File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/posixbase.py", line 61 4, in _doReadOrWrite File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 215, in doRead File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 221, in _dataReceived File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/protocols/basic.py", line 581, in dataReceived File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 1609, in lin eReceived HeadersReceived File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 694, in gotLength File "/usr/lib64/python2.6/tempfile.py", line 475, in TemporaryFile File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner exceptions.OSError: [Errno 24] Too many open files: '/GFS2_tahoe/.tahoe-filestore1/tmp/tmpmKDgf6'
Change History (5)
comment:1 Changed at 2014-11-27T16:53:59Z by zooko
comment:2 Changed at 2014-11-28T17:39:53Z by zooko
My consulting client (codenamed "WAG") reported this result on their RHEL server:
# python iputil.py v110 HELLO WORLD 000 HELLO WORLD 011 012 getrlimit(RLIMIT_NOFILE) before setrlimit: (1024, 4096)
Also this tweep ran it on CentOS and reported that with the patch (https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py) the limit was raised to 4096:
https://twitter.com/brouhaha/status/538085487622107136
CentOS 6.5: before setrlimit: (1024, 4096) ... after setrlimit(4096, 4096): (4096, 4096)
comment:3 Changed at 2014-11-28T17:48:30Z by zooko
See also #1794, #812, and #1278. I currently believe the underlying problem has to do with bad handling of corrupted shares (per comment:7:ticket:812).
comment:4 Changed at 2014-11-29T01:55:06Z by warner
Seems like a decent theory.. you might add a timed loop that counts/logs the number of allmydata.immutable.upload.Uploader instances, and/or upload.FileHandle+subclasses (specifically FileName). If an upload gets wedged and stops making any progress, it will hold a filehandle open forever, and eventually an open() will fail like that.
You might also lsof the client in question and see what filehandles it has open: if it's this problem, there'll be a lot of /tmp files in the list, recently opened by HTTP uploads but not yet pushed out to the grid.
comment:5 Changed at 2014-11-29T17:20:56Z by warner
Also, it might be appropriate to add a failsafe timer to Uploaders, something that fires once every minute or five minutes, and checks to see if any progress has been made, and self-destructs if not. We don't like heuristics, but sometimes they're a good hedge against weird unpredictable things happening.
Here's a branch to tweak the increase_rlimits() code and to printout what it does for diagnostics: https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py