Opened at 2013-07-06T15:16:55Z
Last modified at 2020-01-13T20:36:24Z
#2017 closed defect
non-deterministic test hang on OpenBSD — at Version 10
Reported by: | zooko | Owned by: | sickness |
---|---|---|---|
Priority: | normal | Milestone: | soon |
Component: | code | Version: | 1.10.0 |
Keywords: | iputil heisenbug openbsd test hang | Cc: | |
Launchpad Bug: |
Description (last modified by daira)
sickness's OpenBSD buildslave showed a test timeout:
=============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1139 tests in 1784.336s FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)
Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:
In that run (build number 28), those tests took only a few seconds:
19.917 seconds: allmydata.test.test_runner.RunNode.test_client
13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise
So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.
Change History (10)
comment:1 Changed at 2013-07-06T15:17:35Z by zooko
- Description modified (diff)
comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness
did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:
[...] =============================================================================== [TODO] Reason: 'Fix this after 1.7.1 release.' Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac ed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) =============================================================================== [SKIPPED] A non-ASCII argument/output could not be encoded on this platform. allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output =============================================================================== [TODO] Reason: "this isn't fixed yet" Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac ed none (of which 2 placed none due to the server being full and 0 placed none due to an error)) allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187 allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124 =============================================================================== [TODO] Reason: 'Invent a smarter uploader that uploads successfully in this case.' Traceback (most recent call last): File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response return self._loop() File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop return self._failed("%s (%s)" % (failmsg, self._get_progress_message())) File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error)) allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128 =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1139 tests in 2020.513s FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
comment:3 Changed at 2013-07-07T15:06:45Z by sickness
so then I've run:
git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a python setup.py build && ./bin/tahoe --version ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise and it failed way earlier than the previous branch with this: =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs allmydata.test.test_runner.RunNode.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), **{})()> allmydata.test.test_runner.RunNode.test_client_no_noise ------------------------------------------------------------------------------- Ran 1 tests in 240.027s
comment:4 Changed at 2013-07-07T15:31:11Z by daira
- Keywords test hang added
comment:5 Changed at 2013-07-07T15:34:34Z by daira
If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.
comment:6 Changed at 2013-07-07T19:58:06Z by daira
Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)
comment:7 Changed at 2013-07-09T15:44:44Z by zooko
- Keywords regression added
- Milestone changed from undecided to 1.11.0
comment:9 Changed at 2013-08-28T15:43:53Z by daira
- Description modified (diff)
comment:10 Changed at 2013-08-28T15:51:57Z by daira
- Description modified (diff)
Questions:
I suspect those iputil patches of causing this hang.
sickness: could you please run the unit tests from the current trunk version repeatedly with trial's --until-failure option? ./bin/tahoe debug trial --until-failure allmydata.test (See HowToWriteTests for more options.) If you can reliably reproduce the problem, then would you use git to rewind to before those patches and see if that makes the problem go away? Thanks!