Opened at 2013-07-06T15:16:55Z
Closed at 2020-01-13T20:36:24Z
#2017 closed defect (wontfix)
non-deterministic test hang on OpenBSD
Reported by: | zooko | Owned by: | sickness |
---|---|---|---|
Priority: | normal | Milestone: | soon |
Component: | code | Version: | 1.10.0 |
Keywords: | iputil heisenbug openbsd test hang | Cc: | |
Launchpad Bug: |
Description (last modified by daira)
sickness's OpenBSD buildslave showed a test timeout:
=============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1139 tests in 1784.336s FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)
Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:
In that run (build number 28), those tests took only a few seconds:
19.917 seconds: allmydata.test.test_runner.RunNode.test_client
13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise
So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.
Change History (15)
comment:1 Changed at 2013-07-06T15:17:35Z by zooko
- Description modified (diff)
comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness
did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:
[...] =============================================================================== [TODO] Reason: 'Fix this after 1.7.1 release.' Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac ed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) =============================================================================== [SKIPPED] A non-ASCII argument/output could not be encoded on this platform. allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output =============================================================================== [TODO] Reason: "this isn't fixed yet" Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac ed none (of which 2 placed none due to the server being full and 0 placed none due to an error)) allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187 allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124 =============================================================================== [TODO] Reason: 'Invent a smarter uploader that uploads successfully in this case.' Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error)) allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128 =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1139 tests in 2020.513s FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
comment:3 Changed at 2013-07-07T15:06:45Z by sickness
so then I've run:
git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a python setup.py build && ./bin/tahoe --version ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise and it failed way earlier than the previous branch with this: =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1 tests in 240.027s
comment:4 Changed at 2013-07-07T15:31:11Z by daira
- Keywords test hang added
comment:5 Changed at 2013-07-07T15:34:34Z by daira
If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.
comment:6 Changed at 2013-07-07T19:58:06Z by daira
Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)
comment:7 Changed at 2013-07-09T15:44:44Z by zooko
- Keywords regression added
- Milestone changed from undecided to 1.11.0
comment:9 Changed at 2013-08-28T15:43:53Z by daira
- Description modified (diff)
comment:10 Changed at 2013-08-28T15:51:57Z by daira
- Description modified (diff)
comment:11 Changed at 2013-08-28T15:53:38Z by zooko
sickness: we don't know how to proceed with this. It seems like there might be an old bug, since, per daira's comment:6, d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before the recent iputil patches. So, could you do some investigation of what version(s) of tahoe-lafs exhibit this failure on your !OpenBSD system and report back? You might just start with the Tahoe-LAFS v1.10 release [f9af0633d8da426cbcaed3ff05ab6d7128148bb0] and see if your system could stably run this test on that version. Thanks!
comment:12 Changed at 2013-08-29T07:16:37Z by sickness
# ./bin/tahoe --version allmydata-tahoe: 1.10.0 foolscap: 0.6.4 pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958 zfec: 1.4.24 Twisted: 13.0.0 Nevow: 0.10.0 zope.interface: unknown python: 2.7.3 platform: OpenBSD-5.2-i386-32bit-ELF pyOpenSSL: 0.13 simplejson: 3.3.0 pycrypto: 2.6 pyasn1: 0.1.7 mock: 1.0.1 setuptools: 0.6c16dev4 # ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise Test Pass 1 allmydata.test.test_runner RunNode test_client_no_noise ... [OK] ------------------------------------------------------------------------------- Ran 1 tests in 4.987s PASSED (successes=1) Test Pass 2 allmydata.test.test_runner RunNode test_client_no_noise ... [OK] ------------------------------------------------------------------------------- Ran 1 tests in 5.011s PASSED (successes=1) Test Pass 3 allmydata.test.test_runner RunNode test_client_no_noise ... [ERROR] [ERROR] =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x83fa9dac [0.00449967384338s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x83fc83e4>, 1377735752.361727), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1 tests in 240.022s
comment:13 Changed at 2013-08-29T08:45:11Z by sickness
I have to add that after the previous test exits, a process seems to still hang: tahoe --quiet start test_runner/RunNode/test_client_no_noise/c1
and it seems to sit there forever until I manually kill it :/
comment:14 Changed at 2013-08-29T11:41:04Z by zooko
Hm. Could you try editing the source code, i.e. the code of trunk/src/allmydata/test/test_runner.py, and add these lines somewhere near the top:
from twisted.internet import base base.DelayedCall.debug = True
And then re-run that experiment?
comment:15 Changed at 2020-01-13T20:36:24Z by exarkun
- Resolution set to wontfix
- Status changed from new to closed
The OpenBSD builder is gone. This could be re-opened and fixed by a dedicated OpenBSD maintainer but won't be otherwise.