[tahoe-lafs-trac-stream] [tahoe-lafs] #1768: intermittent test_system failure
tahoe-lafs
trac at tahoe-lafs.org
Tue Jun 25 16:20:08 UTC 2013
#1768: intermittent test_system failure
------------------------+-----------------------
Reporter: warner | Owner:
Type: defect | Status: new
Priority: normal | Milestone: undecided
Component: code | Version: 1.9.1
Resolution: | Keywords: test
Launchpad Bug: |
------------------------+-----------------------
Old description:
> The windows buildslave reported a failure in test_system today, in
> [changeset:b2dcbbb6], which went away upon a rebuild.
>
> https://tahoe-lafs.org/buildbot-tahoe-
> lafs/builders/FreeStorm%20WinXP-x86%20py2.6/builds/103
>
> Relevant console output:
> {{{
> allmydata.test.test_system.SystemTest.test_mutable ... [OK]
> (4.703 secs)
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> ... [<twisted.python.failure.Failure <class
> 'foolscap.ipb.DeadReferenceError'>>]
> Traceback (most recent call last):
> File "c:\python26\lib\site-packages\twisted\internet\defer.py", line
> 134, in maybeDeferred
> result = f(*args, **kw)
> File "c:\buildbot_tahoe\freestorm_winxp-
> x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
> self.fail("Errors snooped, terminating early")
> twisted.trial.unittest.FailTest: Errors snooped, terminating early
> [FAILURE]Traceback (most recent call last):
> Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
> unclean.
> DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to
> debug)
> <DelayedCall 0x494acb0 [0.875s] called=0 cancelled=0
> Reconnector._timer_expired()>
> [ERROR]Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new
> connection
> [ERROR]Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
> [ERROR]Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
> [ERROR]Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
> [ERROR]Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
> [ERROR]Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
> [ERROR]
> (6.500 secs)
> allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
> ... [OK]
> (10.938 secs)
>
> ...
>
> ===============================================================================
> [FAIL]
> Traceback (most recent call last):
> File "c:\python26\lib\site-packages\twisted\internet\defer.py", line
> 134, in maybeDeferred
> result = f(*args, **kw)
> File "c:\buildbot_tahoe\freestorm_winxp-
> x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
> self.fail("Errors snooped, terminating early")
> twisted.trial.unittest.FailTest: Errors snooped, terminating early
>
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> ===============================================================================
> [ERROR]
> Traceback (most recent call last):
> Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
> unclean.
> DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to
> debug)
> <DelayedCall 0x494acb0 [0.875s] called=0 cancelled=0
> Reconnector._timer_expired()>
>
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> ===============================================================================
> [ERROR]
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new
> connection
>
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> ===============================================================================
> [ERROR]
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
>
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> ===============================================================================
> [ERROR]
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
>
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> ===============================================================================
> [ERROR]
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
>
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
> -------------------------------------------------------------------------------
> }}}
>
> and corresponding test.log contents:
>
> {{{
> 2012-06-14 19:38:27.801Z [-] (UDP Port 1279 Closed)
> 2012-06-14 19:38:27.801Z [-] Stopping protocol
> <twisted.internet.protocol.DatagramProtocol instance at 0x035D34E0>
> 2012-06-14 19:38:28.410Z [-] Unhandled Error
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a
> new connection
>
> 2012-06-14 19:38:28.441Z [-] Unhandled error in Deferred:
> 2012-06-14 19:38:28.441Z [-] Unhandled Error
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
>
> 2012-06-14 19:38:28.504Z [-] Unhandled Error
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
>
> 2012-06-14 19:38:28.504Z [-] Unhandled Error
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
>
> 2012-06-14 19:38:28.504Z [-] Unhandled Error
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
>
> 2012-06-14 19:38:28.519Z [-] Unhandled Error
> Traceback (most recent call last):
> Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
> tubid=ncie) (during
> method=RIStorageServer.tahoe.allmydata.com:get_version)
>
> 2012-06-14 19:38:28.519Z [-] (TCP Port 1274 Closed)
> 2012-06-14 19:38:28.519Z [-] Stopping factory <nevow.appserver.NevowSite
> instance at 0x03DAB8F0>
> 2012-06-14 19:38:28.519Z [-] (TCP Port 1190 Closed)
> }}}
>
> The "Unhandled Error" (as opposed to the usual "Unhandled Error in
> Deferred") means that something called twisted's {{{log.err()}}}
> with a Failure or Exception, but nothing else (in particular not a
> why= argument). This is a pretty common (perhaps unfortunate)
> pattern in our code: tahoe has roughly 47 such calls, and Foolscap
> has about 29. So we don't really know what caught the exception
> (and therefore what ought to be handling it differently). The
> exception itself was a !DeadReferenceError, caused by a connection
> being replaced while a remote call was outstanding. And most of the
> remote calls were {{{get_version}}}, which we do immediately after
> a getReference or connectTo completes (with one extra
> {{{RIUploadHelper upload()}}} call). The Dirty Reactor Error
> indicates that something failed to shut down a Tub in the error
> path.
>
> I'm not really sure what's going on here. It feels like the client
> is being bounced while it's in the middle of making some
> connections, as if we're not waiting for the client to finish
> starting up before we shut it down again. It also feels like there
> are some {{{log.err()}}} calls that should be replaced with the
> less-serious {{{log.msg()}}}, since they don't really justify
> flunking the tests. (Any log.err will flunk a test, unless the test
> specifically checks for and discards them).
>
> Throwing a log.err at the end of a Deferred chain when you don't
> really care about it anymore is pretty convenient and safe (you
> won't be losing information about unexpected errors). But it's
> still a sort of "XXX this point is never reached" sort of marker.
> Having them get invoked during tests might suggest that control is
> actually reaching those points.
New description:
The windows buildslave reported a failure in test_system today, in
[changeset:b2dcbbb6], which went away upon a rebuild.
https://tahoe-lafs.org/buildbot-tahoe-
lafs/builders/FreeStorm%20WinXP-x86%20py2.6/builds/103
Relevant console output:
{{{
allmydata.test.test_system.SystemTest.test_mutable ... [OK]
(4.703 secs)
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
... [<twisted.python.failure.Failure <class
'foolscap.ipb.DeadReferenceError'>>]
Traceback (most recent call last):
File "c:\python26\lib\site-packages\twisted\internet\defer.py", line
134, in maybeDeferred
result = f(*args, **kw)
File "c:\buildbot_tahoe\freestorm_winxp-
x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
self.fail("Errors snooped, terminating early")
twisted.trial.unittest.FailTest: Errors snooped, terminating early
[FAILURE]Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to
debug)
<DelayedCall 0x494acb0 [0.875s] called=0 cancelled=0
Reconnector._timer_expired()>
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new
connection
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
[ERROR]Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
[ERROR]
(6.500 secs)
allmydata.test.test_system.SystemTest.test_upload_and_download_random_key
... [OK]
(10.938 secs)
...
===============================================================================
[FAIL]
Traceback (most recent call last):
File "c:\python26\lib\site-packages\twisted\internet\defer.py", line
134, in maybeDeferred
result = f(*args, **kw)
File "c:\buildbot_tahoe\freestorm_winxp-
x86_py2_6\build\src\allmydata\util\pollmixin.py", line 49, in _poll
self.fail("Errors snooped, terminating early")
twisted.trial.unittest.FailTest: Errors snooped, terminating early
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to
debug)
<DelayedCall 0x494acb0 [0.875s] called=0 cancelled=0
Reconnector._timer_expired()>
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new
connection
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
allmydata.test.test_system.SystemTest.test_upload_and_download_convergent
-------------------------------------------------------------------------------
}}}
and corresponding test.log contents:
{{{
2012-06-14 19:38:27.801Z [-] (UDP Port 1279 Closed)
2012-06-14 19:38:27.801Z [-] Stopping protocol
<twisted.internet.protocol.DatagramProtocol instance at 0x035D34E0>
2012-06-14 19:38:28.410Z [-] Unhandled Error
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: [7xfh] replaced by a new
connection
2012-06-14 19:38:28.441Z [-] Unhandled error in Deferred:
2012-06-14 19:38:28.441Z [-] Unhandled Error
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during method=RIUploadHelper.tahoe.allmydata.com:upload)
2012-06-14 19:38:28.504Z [-] Unhandled Error
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during method=RIHelper.tahoe.allmydata.com:get_version)
2012-06-14 19:38:28.504Z [-] Unhandled Error
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
2012-06-14 19:38:28.504Z [-] Unhandled Error
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
2012-06-14 19:38:28.519Z [-] Unhandled Error
Traceback (most recent call last):
Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to
tubid=ncie) (during
method=RIStorageServer.tahoe.allmydata.com:get_version)
2012-06-14 19:38:28.519Z [-] (TCP Port 1274 Closed)
2012-06-14 19:38:28.519Z [-] Stopping factory <nevow.appserver.NevowSite
instance at 0x03DAB8F0>
2012-06-14 19:38:28.519Z [-] (TCP Port 1190 Closed)
}}}
The "Unhandled Error" (as opposed to the usual "Unhandled Error in
Deferred") means that something called twisted's {{{log.err()}}}
with a Failure or Exception, but nothing else (in particular not a
why= argument). This is a pretty common (perhaps unfortunate)
pattern in our code: tahoe has roughly 47 such calls, and Foolscap
has about 29. So we don't really know what caught the exception
(and therefore what ought to be handling it differently). The
exception itself was a !DeadReferenceError, caused by a connection
being replaced while a remote call was outstanding. And most of the
remote calls were {{{get_version}}}, which we do immediately after
a getReference or connectTo completes (with one extra
{{{RIUploadHelper upload()}}} call). The Dirty Reactor Error
indicates that something failed to shut down a Tub in the error
path.
I'm not really sure what's going on here. It feels like the client
is being bounced while it's in the middle of making some
connections, as if we're not waiting for the client to finish
starting up before we shut it down again. It also feels like there
are some {{{log.err()}}} calls that should be replaced with the
less-serious {{{log.msg()}}}, since they don't really justify
flunking the tests. (Any log.err will flunk a test, unless the test
specifically checks for and discards them).
Throwing a log.err at the end of a Deferred chain when you don't
really care about it anymore is pretty convenient and safe (you
won't be losing information about unexpected errors). But it's
still a sort of "XXX this point is never reached" sort of marker.
Having them get invoked during tests might suggest that control is
actually reaching those points.
--
Comment (by zooko):
Hm that stack trace in the final {{{[ERROR]}}} in comment:1 shows that
{{{connectionDropped}}} is getting fired twice:
http://foolscap.lothar.com/trac/browser/foolscap/pb.py?annotate=blame&rev=900c2b4a7aaf9370a76d97a59ebe8943d2dac353#L1070
alternate hosting of same file:
https://github.com/warner/foolscap/blob/900c2b4a7aaf9370a76d97a59ebe8943d2dac353/foolscap/pb.py#l1070
This reminds me of #653, in which we determined that most likely there is
a bug in foolscap which causes {{{notifyOnDisconnect}}} to ''not'' get
called sometimes when it should get called. Putting these two together, it
makes me think that foolscap can currently err on both sides — double-
invoking a "respond to disconnect" event sometimes (but we've seen this
only on Windows) and zero-invoking it other times.
As I've mentioned before, I really think "responding to disconnect events"
is a losing game. To me it smells deal like "clean shutdown logic" in an
app. It tends to be buggy, it is labor-intensive to implement and debug
it, and it can never be 100% correct (because shutdowns of an app are
sometimes hard and because disconnections of a network are sometimes
undetectable).
So my preference for "crash-only programming" in which you don't expend
engineering effort trying to design and implement for handling the "clean
shutdown" case is perhaps related to my preference for "crash-only
networking", in which you assume that your application won't get a
reliable notification of disconnect, and you don't expend engineering
effort trying to deliver one.
So I would be somewhat more interested in removing "respond to disconnect
events" features from foolscap and from Tahoe-LAFS (see #816, #1975) than
in debugging this. However, I'm definitely not very happy with the current
situation, where the unit tests sometimes spuriously (??) fail on Windows
due to this probable bug in foolscap.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1768#comment:3>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list