#2017 closed defect (wontfix)

non-deterministic test hang on OpenBSD

Reported by: zooko Owned by: sickness
Priority: normal Milestone: soon
Component: code Version: 1.10.0
Keywords: iputil heisenbug openbsd test hang Cc:
Launchpad Bug:

Description (last modified by daira)

sickness's OpenBSD buildslave showed a test timeout:

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 1784.336s

FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)

(from https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/27)

Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:

https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/28

In that run (build number 28), those tests took only a few seconds:

 19.917 seconds: allmydata.test.test_runner.RunNode.test_client
 13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise

(from https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/28/steps/test/logs/timings)

So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.

Change History (15)

comment:1 Changed at 2013-07-06T15:17:35Z by zooko

  • Description modified (diff)

Questions:

  1. Does this happen on any other buildslaves?
  1. Did this ever happen before the recent patches which changed the behavior of iputil — [b0883807361830c609dff1677c3cb34fd64d3ebb], [f97b8e5e1df75284aa9b89dd830f8728040eab67], [08590b1f6a880d51751fdcacea6a007ebc568f2e], [16b245563db2f6ca71b9332b06debbe3e1d734b4], [b31a4f6e870cb56efa40c785a868a944b964e8b9], [a493ee0bb641175ecf918e28fce4d25df15994b6], [6104950ed8a7a356eed2218f2df958d074022eea], [f77ec470d75f4b8fb81b1abca4ee3b73f1ad8b22], [8e31d66cd0b0821ccaa2c7c259e7d6f262ad4738], [6a445d73bc5253ec4ae0dec70af02e33bc869cf6]?

I suspect those iputil patches of causing this hang.

sickness: could you please run the unit tests from the current trunk version repeatedly with trial's --until-failure option? ./bin/tahoe debug trial --until-failure allmydata.test (See HowToWriteTests for more options.) If you can reliably reproduce the problem, then would you use git to rewind to before those patches and see if that makes the problem go away? Thanks!

Last edited at 2013-08-28T15:52:23Z by daira (previous) (diff)

comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness

did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:

[...]

===============================================================================
[TODO]   
Reason: 'Fix this after 1.7.1 release.'
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 
2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such
 servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav
e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac
ed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

===============================================================================
[SKIPPED]
A non-ASCII argument/output could not be encoded on this platform.

allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output
===============================================================================
[TODO]   
Reason: "this isn't fixed yet"
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any 
2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such
 servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav
e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac
ed none (of which 2 placed none due to the server being full and 0 placed none due to an error))

allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187
allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124
===============================================================================
[TODO]   
Reason: 'Invent a smarter uploader that uploads successfully in this case.'
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error))

allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 2020.513s

FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
Last edited at 2013-07-07T12:47:50Z by zooko (previous) (diff)

comment:3 Changed at 2013-07-07T15:06:45Z by sickness

so then I've run: git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a python setup.py build && ./bin/tahoe --version ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode?.test_client_no_noise and it failed way earlier than the previous branch with this: =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.internet.defer.TimeoutError?: <allmydata.test.test_runner.RunNode? testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode?.test_client_no_noise =============================================================================== [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError?: Reactor was unclean. DelayedCalls?: (set twisted.internet.base.DelayedCall?.debug = True to debug) <DelayedCall? 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall?<0.01>(RunNode?._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), {})()>

allmydata.test.test_runner.RunNode?.test_client_no_noise


Ran 1 tests in 240.027s

Version 0, edited at 2013-07-07T15:06:45Z by sickness (next)

comment:4 Changed at 2013-07-07T15:31:11Z by daira

  • Keywords test hang added

comment:5 Changed at 2013-07-07T15:34:34Z by daira

If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.

comment:6 Changed at 2013-07-07T19:58:06Z by daira

Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)

comment:7 Changed at 2013-07-09T15:44:44Z by zooko

  • Keywords regression added
  • Milestone changed from undecided to 1.11.0

comment:8 Changed at 2013-08-28T15:42:13Z by zooko

  • Keywords regression removed

sickness:

comment:9 Changed at 2013-08-28T15:43:53Z by daira

  • Description modified (diff)

comment:10 Changed at 2013-08-28T15:51:57Z by daira

  • Description modified (diff)

comment:11 Changed at 2013-08-28T15:53:38Z by zooko

sickness: we don't know how to proceed with this. It seems like there might be an old bug, since, per daira's comment:6, d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before the recent iputil patches. So, could you do some investigation of what version(s) of tahoe-lafs exhibit this failure on your !OpenBSD system and report back? You might just start with the Tahoe-LAFS v1.10 release [f9af0633d8da426cbcaed3ff05ab6d7128148bb0] and see if your system could stably run this test on that version. Thanks!

comment:12 Changed at 2013-08-29T07:16:37Z by sickness

# ./bin/tahoe --version                                                                                                                                            
allmydata-tahoe: 1.10.0
foolscap: 0.6.4
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
zfec: 1.4.24
Twisted: 13.0.0
Nevow: 0.10.0
zope.interface: unknown
python: 2.7.3
platform: OpenBSD-5.2-i386-32bit-ELF
pyOpenSSL: 0.13
simplejson: 3.3.0
pycrypto: 2.6
pyasn1: 0.1.7
mock: 1.0.1
setuptools: 0.6c16dev4

# ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise
Test Pass 1
allmydata.test.test_runner
  RunNode
    test_client_no_noise ...                                               [OK]

-------------------------------------------------------------------------------
Ran 1 tests in 4.987s

PASSED (successes=1)
Test Pass 2
allmydata.test.test_runner
  RunNode
    test_client_no_noise ...                                               [OK]

-------------------------------------------------------------------------------
Ran 1 tests in 5.011s

PASSED (successes=1)
Test Pass 3
allmydata.test.test_runner
  RunNode
    test_client_no_noise ...                                            [ERROR]
                                           [ERROR]

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x83fa9dac [0.00449967384338s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x83fc83e4>, 1377735752.361727), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1 tests in 240.022s

Last edited at 2013-08-29T11:35:45Z by zooko (previous) (diff)

comment:13 Changed at 2013-08-29T08:45:11Z by sickness

I have to add that after the previous test exits, a process seems to still hang: tahoe --quiet start test_runner/RunNode/test_client_no_noise/c1

and it seems to sit there forever until I manually kill it :/

comment:14 Changed at 2013-08-29T11:41:04Z by zooko

Hm. Could you try editing the source code, i.e. the code of trunk/src/allmydata/test/test_runner.py, and add these lines somewhere near the top:

from twisted.internet import base
base.DelayedCall.debug = True

And then re-run that experiment?

comment:15 Changed at 2020-01-13T20:36:24Z by exarkun

  • Resolution set to wontfix
  • Status changed from new to closed

The OpenBSD builder is gone. This could be re-opened and fixed by a dedicated OpenBSD maintainer but won't be otherwise.

Note: See TracTickets for help on using tickets.