Opened at 2010-12-10T20:48:02Z
Last modified at 2014-12-04T15:04:33Z
#1278 new defect
gateway won't serve any page; variety of interesting error messages in twistd.log
Reported by: | zooko | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code | Version: | 1.8.0 |
Keywords: | reliability availability error hang heisenbug wui anti-censorship test-needed | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
Some servers on the public test grid are misbehaving -- I can tell because updating my blog takes about ten minutes sometimes. Having updated my blog a few times this morning, my gateway got into a broken state where it will not even load the recent uploads and downloads page...
Oh, wait, now it works again. Hm, I will attach twistd.log, and here are a few of the interesting messages:
2010-12-08 22:48:44-0800 [HTTPChannel,8,127.0.0.1] Unhandled Error Traceback (most recent call last): Failure: allmydata.immutable.downloader.common.DownloadStopped: our Consumer called stopProducing() 2010-12-10 10:21:11-0800 [-] Unhandled Error Traceback (most recent call last): File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 546, in _got_response return self._loop() File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 435, in _loop self.contacted_peers2) File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 320, in addBoth callbackKeywords=kw, errbackKeywords=kw) File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 546, in _got_response return self._loop() File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 458, in _loop return self._failed(msg) File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 561, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: peer selection failed for <Tahoe2PeerSelector for upload kcaaq>: shares could be placed on only 4 server(s) such that any 3 of them have enough share s to recover the file, but we were asked to place shares on at least 5 such servers. (placed 9 shares out of 10 total (1 homeless), want to place shares on at least 5 servers such that any 3 of them hav e enough shares to recover the file, sent 13 queries to 11 peers, 2 queries placed some shares, 11 placed none (of which 4 placed none due to the server being full and 7 placed none due to an error)) (l ast failure (from <PeerTracker for peer nszizgf5 and SI kcaaq>) was: [Failure instance: Traceback: <class 'foolscap.ipb.DeadReferenceError'>: Calling Stale Broker /Library/Python/2.6/site-packages/allmydata/immutable/upload.py:546:_got_response /Library/Python/2.6/site-packages/allmydata/immutable/upload.py:433:_loop /Library/Python/2.6/site-packages/allmydata/immutable/upload.py:119:query /Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/referenceable.py:408:callRemote --- <exception caught here> --- /Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:133:maybeDeferred /Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/referenceable.py:447:_callRemote /Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/broker.py:468:newRequestID ]) 2010-12-10 10:21:12-0800 [-] Exception rendering error page: 2010-12-10 10:21:12-0800 [-] Unhandled Error Traceback (most recent call last): File "/Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/call.py", line 95, in fail self.deferred.errback(why) File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 388, in errback self._startRunCallbacks(fail) File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) --- <exception caught here> --- File "/Library/Python/2.6/site-packages/Nevow-0.10.0-py2.6.egg/nevow/appserver.py", line 82, in processingFailed handler.renderHTTP_exception(ctx, reason) File "/Library/Python/2.6/site-packages/allmydata/web/common.py", line 278, in renderHTTP_exception return super.renderHTTP_exception(self, ctx, f) File "/Library/Python/2.6/site-packages/Nevow-0.10.0-py2.6.egg/nevow/appserver.py", line 61, in renderHTTP_exception request.finishRequest( False ) File "/Library/Python/2.6/site-packages/Nevow-0.10.0-py2.6.egg/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/web/http.py", line 900, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this. 2010-12-10 10:21:12-0800 [-] 'Original exception:'
2010-12-10 11:26:14-0800 [-] Unhandled Error Traceback (most recent call last): File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 694, in <lambda> self._read_encrypted(length, ciphertext, hash_only, d2)) File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 709, in _read_encrypted d = defer.maybeDeferred(self.original.read, size) --- <exception caught here> --- File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 133, in maybeDeferred result = f(*args, **kw) File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 1362, in read return defer.succeed([self._filehandle.read(length)]) exceptions.ValueError: I/O operation on closed file
There is one flog incident file. I'll attach it.
Attachments (2)
Change History (13)
Changed at 2010-12-10T20:49:55Z by zooko
Changed at 2010-12-10T20:51:02Z by zooko
comment:1 Changed at 2010-12-16T01:11:02Z by davidsarah
- Keywords anti-censorship added
comment:2 Changed at 2011-02-21T06:00:09Z by davidsarah
- Keywords test-needed added
- Milestone changed from undecided to 1.9.0
comment:3 Changed at 2011-02-21T06:14:31Z by warner
Hm, the #1366 patch would probably address the second weird message. But I don't think it could have been causing the web server to be unavailable forever. So yeah, let's leave this ticket open a bit longer until we understand what was going on.
comment:4 Changed at 2011-10-13T17:05:50Z by warner
- Milestone changed from 1.9.0 to 1.10.0
not making it into 1.9
comment:5 Changed at 2012-01-22T00:17:27Z by nejucomo
comment:6 Changed at 2014-11-28T15:50:09Z by zooko
- Description modified (diff)
I have a consulting client (codename "WAG") who is reporting an issue with very similar symptoms. They say that they have hundreds of thousands of people inconvenienced by this failure, so it is urgent for them to figure out what happened.
comment:7 follow-ups: ↓ 8 ↓ 9 Changed at 2014-12-01T17:38:36Z by daira
Zooko, what's the evidence that this is the same issue? In my experience, two problem reports are only likely to be the same issue if they result in similar error reporting/tracebacks (in this case "exceptions.ValueError?: I/O operation on closed file", for example).
Note that #1336 already fixed the "exceptions.RuntimeError?: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this." issue.
comment:8 in reply to: ↑ 7 ; follow-up: ↓ 11 Changed at 2014-12-03T21:10:05Z by zooko
Replying to daira:
Zooko, what's the evidence that this is the same issue?
First of all the observed behavior of the gateway is similar — goes slower and slower and then "locks ups" entirely, refusing to display even its own Welcome Page. But, like you say, actual stack traces are a more reliable indicator of a common cause, and we have them…
WAG got these messages, in an incident report file named incident-2014-11-18--08-59-32Z-nchatry.flog.bz2:
06:20:45.186 [19723]: UNUSUAL error during GET FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py", line 222, in <lambda> self._pause_deferred.addCallback(lambda ignored: d.callback(res)) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback self._startRunCallbacks(result) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py", line 228, in _check_for_stopped raise DownloadStopped("our Consumer called stopProducing()") allmydata.interfaces.DownloadStopped: our Consumer called stopProducing() ]
and:
08:59:33.278 [28364487]: Unhandled Error Traceback (most recent call last): Failure: exceptions.RuntimeError: Producer was not unregistered for /file/URI:MDMF:CENSORED:CENSORED FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): Failure: exceptions.RuntimeError: Producer was not unregistered for /file/URI:MDMF:CENSORED:CENSORED ]
And these messages in an incident report file named incident-2014-10-16--14-09-34Z-twxj47a.flog.bz2:
2014-10-16_14:09:15.274918Z [11791490]: {'from-twisted':True, 'incarnation':('\xd4\xf0\xeb\x82\x9b\xa3\x05$', None), 'level':20, 'message':'Unhandled Error\nTraceback (most recent call last):\n File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-...0.0/src/allmydata/web/filenode.py", line 319, in replace_my_contents\n req.content.seek(0)\nexceptions.ValueError: I/O operation on closed file\n', 'system':'-', 'why':None, ...} FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback self._startRunCallbacks(result) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks self._runCallbacks() File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 163, in gotPageContext pageContext.tag.renderHTTP, pageContext --- <exception caught here> --- File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 137, in maybeDeferred result = f(*args, **kw) File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/common.py", line 352, in renderHTTP return m(ctx) File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/filenode.py", line 245, in render_PUT return self.replace_my_contents(req) File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/filenode.py", line 319, in replace_my_contents req.content.seek(0) exceptions.ValueError: I/O operation on closed file ]
and:
2014-10-16_14:09:15.281716Z [11791492]: {'from-twisted':True, 'incarnation':('\xd4\xf0\xeb\x82\x9b\xa3\x05$', None), 'level':20, 'message':'Unhandled Error\nTraceback (most recent call last):\n File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.eg...ost; "\nexceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.\n', 'system':'-', 'why':None, ...} FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 167, in gotPageContext processingFailed, self, pageContext File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 315, in addErrback errbackKeywords=kw) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 293, in addCallbacks self._runCallbacks() File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86 -- TRACEBACK ELIDED -- -packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 82, in processingFailed handler.renderHTTP_exception(ctx, reason) File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/common.py", line 329, in renderHTTP_exception return super.renderHTTP_exception(self, ctx, f) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 61, in renderHTTP_exception request.finishRequest( False ) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/server.py", line 231, in finish return http.Request.finish(self) File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 885, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this. ]
comment:9 in reply to: ↑ 7 Changed at 2014-12-03T21:10:40Z by zooko
comment:10 Changed at 2014-12-03T21:23:25Z by zooko
comment:11 in reply to: ↑ 8 Changed at 2014-12-04T15:04:33Z by daira
Replying to zooko:
Replying to daira:
Zooko, what's the evidence that this is the same issue?
First of all the observed behavior of the gateway is similar — goes slower and slower and then "locks ups" entirely, refusing to display even its own Welcome Page. But, like you say, actual stack traces are a more reliable indicator of a common cause, and we have them…
OK, I'm convinced.
Possibly fixed by the patch on #1366. However, to close this ticket we should have a reproducible test.