#1768 new defect

intermittent test_system failure — at Version 3

Reported by: warner Owned by:
Priority: normal Milestone: undecided
Component: code Version: 1.9.1
Keywords: test windows heisenbug Cc:
Launchpad Bug:

Description (last modified by zooko)

The windows buildslave reported a failure in test_system today, in b2dcbbb6, which went away upon a rebuild.

https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/FreeStorm%20WinXP-x86%20py2.6/builds/103

Relevant console output:

allmydata.test.test_system.SystemTest.test_mutable ... [OK]
(4.703 secs)
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent ... [<twisted.python.failure.Failure <class 'foolscap.ipb.DeadReferenceError'>>]
Traceback (most recent call last):
  File "c:\python26\lib\site-packages\twisted\internet\defer.py", line 134, in maybeDeferred
    result = f(*args, **kw)
  File "c:\buildbot_tahoe\freestorm_winxp-x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
    self.fail("Errors snooped, terminating early")
twisted.trial.unittest.FailTest: Errors snooped, terminating early
[FAILURE]Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x494acb0 [0.875s] called=0 cancelled=0 Reconnector._timer_expired()>
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new connection
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
[ERROR]
(6.500 secs)
allmydata.test.test_system.SystemTest.test_upload_and_download_random_key ... [OK]
(10.938 secs)

...

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "c:\python26\lib\site-packages\twisted\internet\defer.py", line 134, in maybeDeferred
    result = f(*args, **kw)
  File "c:\buildbot_tahoe\freestorm_winxp-x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
    self.fail("Errors snooped, terminating early")
twisted.trial.unittest.FailTest: Errors snooped, terminating early

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x494acb0 [0.875s] called=0 cancelled=0 Reconnector._timer_expired()>

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new connection

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)

allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
-------------------------------------------------------------------------------

and corresponding test.log contents:

2012-06-14 19:38:27.801Z [-] (UDP Port 1279 Closed)
2012-06-14 19:38:27.801Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x035D34E0>
2012-06-14 19:38:28.410Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new connection
	
2012-06-14 19:38:28.441Z [-] Unhandled error in Deferred:
2012-06-14 19:38:28.441Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
	
2012-06-14 19:38:28.504Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
	
2012-06-14 19:38:28.504Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
	
2012-06-14 19:38:28.504Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
	
2012-06-14 19:38:28.519Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)
	
2012-06-14 19:38:28.519Z [-] (TCP Port 1274 Closed)
2012-06-14 19:38:28.519Z [-] Stopping factory <nevow.appserver.NevowSite instance at 0x03DAB8F0>
2012-06-14 19:38:28.519Z [-] (TCP Port 1190 Closed)

The "Unhandled Error" (as opposed to the usual "Unhandled Error in Deferred") means that something called twisted's log.err() with a Failure or Exception, but nothing else (in particular not a why= argument). This is a pretty common (perhaps unfortunate) pattern in our code: tahoe has roughly 47 such calls, and Foolscap has about 29. So we don't really know what caught the exception (and therefore what ought to be handling it differently). The exception itself was a DeadReferenceError, caused by a connection being replaced while a remote call was outstanding. And most of the remote calls were get_version, which we do immediately after a getReference or connectTo completes (with one extra RIUploadHelper upload() call). The Dirty Reactor Error indicates that something failed to shut down a Tub in the error path.

I'm not really sure what's going on here. It feels like the client is being bounced while it's in the middle of making some connections, as if we're not waiting for the client to finish starting up before we shut it down again. It also feels like there are some log.err() calls that should be replaced with the less-serious log.msg(), since they don't really justify flunking the tests. (Any log.err will flunk a test, unless the test specifically checks for and discards them).

Throwing a log.err at the end of a Deferred chain when you don't really care about it anymore is pretty convenient and safe (you won't be losing information about unexpected errors). But it's still a sort of "XXX this point is never reached" sort of marker. Having them get invoked during tests might suggest that control is actually reaching those points.

Change History (3)

comment:1 Changed at 2013-03-19T23:54:21Z by davidsarah

Another probable instance of this error on the same buildslave (build 170):

[FAIL]
Traceback (most recent call last):
  File "c:\python26\lib\site-packages\twisted\internet\defer.py", line 134, in maybeDeferred
    result = f(*args, **kw)
  File "c:\buildbot_tahoe\freestorm_winxp-x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
    self.fail("Errors snooped, terminating early")
twisted.trial.unittest.FailTest: Errors snooped, terminating early

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "c:\python26\lib\site-packages\twisted\internet\defer.py", line 134, in maybeDeferred
    result = f(*args, **kw)
  File "C:\buildbot_tahoe\FreeStorm_WinXP-x86_py2_6\build\support\Lib\site-packages\foolscap-0.6.4-py2.6.egg\foolscap\referenceable.py", line 455, in _callRemote
    reqID = broker.newRequestID()
  File "C:\buildbot_tahoe\FreeStorm_WinXP-x86_py2_6\build\support\Lib\site-packages\foolscap-0.6.4-py2.6.egg\foolscap\broker.py", line 477, in newRequestID
    raise DeadReferenceError("Calling Stale Broker")
foolscap.ipb.DeadReferenceError: Calling Stale Broker

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=iwlr) (during method=RIStorageServer.tahoe.allmydata.com:get_version)

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=z3g3) (during method=RIStorageServer.tahoe.allmydata.com:get_version)

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=ncie) (during method=RIStorageServer.tahoe.allmydata.com:get_version)

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "c:\python26\lib\site-packages\twisted\python\log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "c:\python26\lib\site-packages\twisted\python\log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "c:\python26\lib\site-packages\twisted\python\context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "c:\python26\lib\site-packages\twisted\python\context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "c:\python26\lib\site-packages\twisted\internet\selectreactor.py", line 155, in _doReadOrWrite
    self._disconnectSelectable(selectable, why, method=="doRead")
  File "c:\python26\lib\site-packages\twisted\internet\posixbase.py", line 263, in _disconnectSelectable
    selectable.connectionLost(f)
  File "c:\python26\lib\site-packages\twisted\internet\tcp.py", line 433, in connectionLost
    Connection.connectionLost(self, reason)
  File "c:\python26\lib\site-packages\twisted\internet\tcp.py", line 277, in connectionLost
    protocol.connectionLost(reason)
  File "c:\python26\lib\site-packages\twisted\protocols\tls.py", line 460, in connectionLost
    ProtocolWrapper.connectionLost(self, reason)
  File "c:\python26\lib\site-packages\twisted\protocols\policies.py", line 123, in connectionLost
    self.wrappedProtocol.connectionLost(reason)
  File "C:\buildbot_tahoe\FreeStorm_WinXP-x86_py2_6\build\support\Lib\site-packages\foolscap-0.6.4-py2.6.egg\foolscap\broker.py", line 253, in connectionLost
    self.tub.connectionDropped(self)
  File "C:\buildbot_tahoe\FreeStorm_WinXP-x86_py2_6\build\support\Lib\site-packages\foolscap-0.6.4-py2.6.egg\foolscap\pb.py", line 1070, in connectionDropped
    self._allConnectionsAreDisconnected.fire(self)
  File "C:\buildbot_tahoe\FreeStorm_WinXP-x86_py2_6\build\support\Lib\site-packages\foolscap-0.6.4-py2.6.egg\foolscap\observer.py", line 42, in fire
    assert not self._fired
exceptions.AssertionError: 

allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
-------------------------------------------------------------------------------
Last edited at 2013-03-19T23:54:34Z by davidsarah (previous) (diff)

comment:2 Changed at 2013-03-20T00:01:40Z by davidsarah

This may be related to or a duplicate of #1757. (Alternatively, if they are distinct bugs, then comment:1 could be an instance of #1757 instead of this bug.)

comment:3 Changed at 2013-06-25T16:20:07Z by zooko

  • Description modified (diff)

Hm that stack trace in the final [ERROR] in comment:1 shows that connectionDropped is getting fired twice:

http://foolscap.lothar.com/trac/browser/foolscap/pb.py?annotate=blame&rev=900c2b4a7aaf9370a76d97a59ebe8943d2dac353#L1070

alternate hosting of same file:

https://github.com/warner/foolscap/blob/900c2b4a7aaf9370a76d97a59ebe8943d2dac353/foolscap/pb.py#l1070

This reminds me of #653, in which we determined that most likely there is a bug in foolscap which causes notifyOnDisconnect to not get called sometimes when it should get called. Putting these two together, it makes me think that foolscap can currently err on both sides — double-invoking a "respond to disconnect" event sometimes (but we've seen this only on Windows) and zero-invoking it other times.

As I've mentioned before, I really think "responding to disconnect events" is a losing game. To me it smells like "clean shutdown logic" in an app. It tends to be buggy, it is labor-intensive to implement and debug it, and it can never be 100% correct (because shutdowns of an app are sometimes hard and because disconnections of a network are sometimes undetectable).

So my preference for "crash-only programming" (in which you don't expend engineering effort trying to design and implement the "clean shutdown" case) is perhaps related to my preference for "crash-only networking", in which you assume that your application won't get a reliable notification of disconnect, and you don't expend engineering effort trying to deliver one.

So I would be somewhat more interested in removing "respond to disconnect events" features from foolscap and from Tahoe-LAFS (see #816, #1975) than in debugging this. However, I'm definitely not very happy with the current situation, where the unit tests sometimes spuriously (??) fail on Windows due to this probable bug in foolscap.

Last edited at 2013-07-03T16:55:17Z by zooko (previous) (diff)
Note: See TracTickets for help on using tickets.