#1278 new defect

gateway won't serve any page; variety of interesting error messages in twistd.log

Reported by: zooko Owned by: somebody
Priority: major Milestone: soon
Component: code Version: 1.8.0
Keywords: reliability availability error hang heisenbug wui anti-censorship test-needed Cc:
Launchpad Bug:

Description (last modified by zooko)

Some servers on the public test grid are misbehaving -- I can tell because updating my blog takes about ten minutes sometimes. Having updated my blog a few times this morning, my gateway got into a broken state where it will not even load the recent uploads and downloads page...

Oh, wait, now it works again. Hm, I will attach twistd.log, and here are a few of the interesting messages:

2010-12-08 22:48:44-0800 [HTTPChannel,8,127.0.0.1] Unhandled Error
        Traceback (most recent call last):
        Failure: allmydata.immutable.downloader.common.DownloadStopped: our Consumer called stopProducing()
        
2010-12-10 10:21:11-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 546, in _got_response
            return self._loop()
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 435, in _loop
            self.contacted_peers2)
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 320, in addBoth
            callbackKeywords=kw, errbackKeywords=kw)
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 286, in addCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 542, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 546, in _got_response
            return self._loop()
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 458, in _loop
            return self._failed(msg)
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 561, in _failed
            raise UploadUnhappinessError(msg)
        allmydata.interfaces.UploadUnhappinessError: peer selection failed for <Tahoe2PeerSelector for upload kcaaq>: shares could be placed on only 4 server(s) such that any 3 of them have enough share
s to recover the file, but we were asked to place shares on at least 5 such servers. (placed 9 shares out of 10 total (1 homeless), want to place shares on at least 5 servers such that any 3 of them hav
e enough shares to recover the file, sent 13 queries to 11 peers, 2 queries placed some shares, 11 placed none (of which 4 placed none due to the server being full and 7 placed none due to an error)) (l
ast failure (from <PeerTracker for peer nszizgf5 and SI kcaaq>) was: [Failure instance: Traceback: <class 'foolscap.ipb.DeadReferenceError'>: Calling Stale Broker
        /Library/Python/2.6/site-packages/allmydata/immutable/upload.py:546:_got_response
        /Library/Python/2.6/site-packages/allmydata/immutable/upload.py:433:_loop
        /Library/Python/2.6/site-packages/allmydata/immutable/upload.py:119:query
        /Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/referenceable.py:408:callRemote
        --- <exception caught here> ---
        /Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:133:maybeDeferred
        /Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/referenceable.py:447:_callRemote
        /Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/broker.py:468:newRequestID
        ])
        
2010-12-10 10:21:12-0800 [-] Exception rendering error page:
2010-12-10 10:21:12-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/Library/Python/2.6/site-packages/foolscap-0.5.1_-py2.6.egg/foolscap/call.py", line 95, in fail
            self.deferred.errback(why)
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 388, in errback
            self._startRunCallbacks(fail)
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 455, in _startRunCallbacks
            self._runCallbacks()
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 542, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
        --- <exception caught here> ---
          File "/Library/Python/2.6/site-packages/Nevow-0.10.0-py2.6.egg/nevow/appserver.py", line 82, in processingFailed
            handler.renderHTTP_exception(ctx, reason)
          File "/Library/Python/2.6/site-packages/allmydata/web/common.py", line 278, in renderHTTP_exception
            return super.renderHTTP_exception(self, ctx, f)
          File "/Library/Python/2.6/site-packages/Nevow-0.10.0-py2.6.egg/nevow/appserver.py", line 61, in renderHTTP_exception
            request.finishRequest( False )
          File "/Library/Python/2.6/site-packages/Nevow-0.10.0-py2.6.egg/nevow/appserver.py", line 176, in finishRequest
            server.Request.finish(self)
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/web/http.py", line 900, in finish
            "Request.finish called on a request after its connection was lost; "
        exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
        
2010-12-10 10:21:12-0800 [-] 'Original exception:'
2010-12-10 11:26:14-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 286, in addCallbacks
            self._runCallbacks()
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 542, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 694, in <lambda>
            self._read_encrypted(length, ciphertext, hash_only, d2))
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 709, in _read_encrypted
            d = defer.maybeDeferred(self.original.read, size)
        --- <exception caught here> ---
          File "/Library/Python/2.6/site-packages/Twisted-10.2.0_r30364-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py", line 133, in maybeDeferred
            result = f(*args, **kw)
          File "/Library/Python/2.6/site-packages/allmydata/immutable/upload.py", line 1362, in read
            return defer.succeed([self._filehandle.read(length)])
        exceptions.ValueError: I/O operation on closed file
        

There is one flog incident file. I'll attach it.

Attachments (2)

twistd.log (507.1 KB) - added by zooko at 2010-12-10T20:49:55Z.
27Z-cgqrfqa.flog.bz2 (43.3 KB) - added by zooko at 2010-12-10T20:51:02Z.

Download all attachments as: .zip

Change History (13)

Changed at 2010-12-10T20:49:55Z by zooko

Changed at 2010-12-10T20:51:02Z by zooko

comment:1 Changed at 2010-12-16T01:11:02Z by davidsarah

  • Keywords anti-censorship added

comment:2 Changed at 2011-02-21T06:00:09Z by davidsarah

  • Keywords test-needed added
  • Milestone changed from undecided to 1.9.0

Possibly fixed by #1366. However, to close this ticket we should have a reproducible test.

Version 0, edited at 2011-02-21T06:00:09Z by davidsarah (next)

comment:3 Changed at 2011-02-21T06:14:31Z by warner

Hm, the #1366 patch would probably address the second weird message. But I don't think it could have been causing the web server to be unavailable forever. So yeah, let's leave this ticket open a bit longer until we understand what was going on.

comment:4 Changed at 2011-10-13T17:05:50Z by warner

  • Milestone changed from 1.9.0 to 1.10.0

not making it into 1.9

comment:5 Changed at 2012-01-22T00:17:27Z by nejucomo

I've seen similar behavior with respect to the second exception, but suspect my symptoms are distinct from #1366, so I filed #1664.

I have not seen the UploadUnhappinessError. I am using an LAE account where there N = K = 1.

comment:6 Changed at 2014-11-28T15:50:09Z by zooko

  • Description modified (diff)

I have a consulting client (codename "WAG") who is reporting an issue with very similar symptoms. They say that they have hundreds of thousands of people inconvenienced by this failure, so it is urgent for them to figure out what happened.

comment:7 follow-ups: Changed at 2014-12-01T17:38:36Z by daira

Zooko, what's the evidence that this is the same issue? In my experience, two problem reports are only likely to be the same issue if they result in similar error reporting/tracebacks (in this case "exceptions.ValueError: I/O operation on closed file", for example).

Note that #1366 already fixed the "exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this." issue.

Last edited at 2014-12-01T17:40:16Z by daira (previous) (diff)

comment:8 in reply to: ↑ 7 ; follow-up: Changed at 2014-12-03T21:10:05Z by zooko

Replying to daira:

Zooko, what's the evidence that this is the same issue?

First of all the observed behavior of the gateway is similar — goes slower and slower and then "locks ups" entirely, refusing to display even its own Welcome Page. But, like you say, actual stack traces are a more reliable indicator of a common cause, and we have them…

WAG got these messages, in an incident report file named incident-2014-11-18--08-59-32Z-nchatry.flog.bz2:

06:20:45.186 [19723]: UNUSUAL error during GET FAILURE:
[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py", line 222, in <lambda>
    self._pause_deferred.addCallback(lambda ignored: d.callback(res))
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback
    self._startRunCallbacks(result)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py", line 228, in _check_for_stopped
    raise DownloadStopped("our Consumer called stopProducing()")
allmydata.interfaces.DownloadStopped: our Consumer called stopProducing()
]

and:

08:59:33.278 [28364487]: Unhandled Error Traceback (most recent call last): Failure: exceptions.RuntimeError: Producer was not unregistered for /file/URI:MDMF:CENSORED:CENSORED FAILURE:
[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: exceptions.RuntimeError: Producer was not unregistered for /file/URI:MDMF:CENSORED:CENSORED
]

And these messages in an incident report file named incident-2014-10-16--14-09-34Z-twxj47a.flog.bz2:

2014-10-16_14:09:15.274918Z [11791490]: {'from-twisted':True, 'incarnation':('\xd4\xf0\xeb\x82\x9b\xa3\x05$', None), 'level':20, 'message':'Unhandled Error\nTraceback (most recent call last):\n File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-...0.0/src/allmydata/web/filenode.py", line 319, in replace_my_contents\n req.content.seek(0)\nexceptions.ValueError: I/O operation on closed file\n', 'system':'-', 'why':None, ...} FAILURE:
[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 380, in callback
    self._startRunCallbacks(result)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 488, in _startRunCallbacks
    self._runCallbacks()
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 575, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 163, in gotPageContext
    pageContext.tag.renderHTTP, pageContext
--- <exception caught here> ---
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 137, in maybeDeferred
    result = f(*args, **kw)
  File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/common.py", line 352, in renderHTTP
    return m(ctx)
  File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/filenode.py", line 245, in render_PUT
    return self.replace_my_contents(req)
  File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/filenode.py", line 319, in replace_my_contents
    req.content.seek(0)
exceptions.ValueError: I/O operation on closed file
]

and:

2014-10-16_14:09:15.281716Z [11791492]: {'from-twisted':True, 'incarnation':('\xd4\xf0\xeb\x82\x9b\xa3\x05$', None), 'level':20, 'message':'Unhandled Error\nTraceback (most recent call last):\n File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.eg...ost; "\nexceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.\n', 'system':'-', 'why':None, ...} FAILURE:
[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 167, in gotPageContext
    processingFailed, self, pageContext
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 315, in addErrback
    errbackKeywords=kw)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 293, in addCallbacks
    self._runCallbacks()
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86

-- TRACEBACK ELIDED --

-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 82, in processingFailed
    handler.renderHTTP_exception(ctx, reason)
  File "/data/allmydata-tahoe-1.10.0/src/allmydata/web/common.py", line 329, in renderHTTP_exception
    return super.renderHTTP_exception(self, ctx, f)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 61, in renderHTTP_exception
    request.finishRequest( False )
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Nevow-0.11.1-py2.6.egg/nevow/appserver.py", line 176, in finishRequest
    server.Request.finish(self)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/server.py", line 231, in finish
    return http.Request.finish(self)
  File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 885, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
]

comment:9 in reply to: ↑ 7 Changed at 2014-12-03T21:10:40Z by zooko

Replying to daira:

Note that #1366 already fixed the "exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this." issue.

Hm… could that issue have been causing a file descriptor leak?

comment:10 Changed at 2014-12-03T21:23:25Z by zooko

#1664 was similar to #1366, but is still open, because for some reason nejucomo suspected that they were actually different issues, so that the fix to #1366 wouldn't fix #1664. Could #1664 be affecting WAG and could it cause a file descriptor leak? (I doubt it.)

comment:11 in reply to: ↑ 8 Changed at 2014-12-04T15:04:33Z by daira

Replying to zooko:

Replying to daira:

Zooko, what's the evidence that this is the same issue?

First of all the observed behavior of the gateway is similar — goes slower and slower and then "locks ups" entirely, refusing to display even its own Welcome Page. But, like you say, actual stack traces are a more reliable indicator of a common cause, and we have them…

OK, I'm convinced.

Note: See TracTickets for help on using tickets.