#2017 closed defect (wontfix)

non-deterministic test hang on OpenBSD

Reported by: zooko Owned by: sickness
Priority: normal Milestone: soon
Component: code Version: 1.10.0
Keywords: iputil heisenbug openbsd test hang Cc:
Launchpad Bug:

Description (last modified by daira)

sickness's OpenBSD buildslave showed a test timeout:

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 1784.336s

FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)

(from https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/27)

Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:

https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/28

In that run (build number 28), those tests took only a few seconds:

 19.917 seconds: allmydata.test.test_runner.RunNode.test_client
 13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise

(from https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/sickness%20OpenBSD%205.0%20x86%20py2.7/builds/28/steps/test/logs/timings)

So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.

Change History (15)

comment:1 Changed at 2013-07-06T15:17:35Z by zooko

  • Description modified (diff)
Version 0, edited at 2013-07-06T15:17:35Z by zooko (next)

comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness

did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:

[...]

===============================================================================
[TODO]   
Reason: 'Fix this after 1.7.1 release.'
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 
2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such
 servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav
e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac
ed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

===============================================================================
[SKIPPED]
A non-ASCII argument/output could not be encoded on this platform.

allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output
===============================================================================
[TODO]   
Reason: "this isn't fixed yet"
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any 
2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such
 servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav
e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac
ed none (of which 2 placed none due to the server being full and 0 placed none due to an error))

allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187
allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124
===============================================================================
[TODO]   
Reason: 'Invent a smarter uploader that uploads successfully in this case.'
Traceback (most recent call last):
  File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
    return self._loop()
  File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
    return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
  File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error))

allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 2020.513s

FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
Last edited at 2013-07-07T12:47:50Z by zooko (previous) (diff)

comment:3 Changed at 2013-07-07T15:06:45Z by sickness

so then I've run:

git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a
python setup.py build && ./bin/tahoe --version
./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise
and it failed way earlier than the previous branch with this:
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1 tests in 240.027s
Last edited at 2013-08-28T15:37:52Z by zooko (previous) (diff)

comment:4 Changed at 2013-07-07T15:31:11Z by daira

  • Keywords test hang added

comment:5 Changed at 2013-07-07T15:34:34Z by daira

If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.

comment:6 Changed at 2013-07-07T19:58:06Z by daira

Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)

comment:7 Changed at 2013-07-09T15:44:44Z by zooko

  • Keywords regression added
  • Milestone changed from undecided to 1.11.0

comment:8 Changed at 2013-08-28T15:42:13Z by zooko

  • Keywords regression removed

sickness:

comment:9 Changed at 2013-08-28T15:43:53Z by daira

  • Description modified (diff)

comment:10 Changed at 2013-08-28T15:51:57Z by daira

  • Description modified (diff)

comment:11 Changed at 2013-08-28T15:53:38Z by zooko

sickness: we don't know how to proceed with this. It seems like there might be an old bug, since, per daira's comment:6, d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before the recent iputil patches. So, could you do some investigation of what version(s) of tahoe-lafs exhibit this failure on your !OpenBSD system and report back? You might just start with the Tahoe-LAFS v1.10 release [f9af0633d8da426cbcaed3ff05ab6d7128148bb0] and see if your system could stably run this test on that version. Thanks!

comment:12 Changed at 2013-08-29T07:16:37Z by sickness

# ./bin/tahoe --version                                                                                                                                            
allmydata-tahoe: 1.10.0
foolscap: 0.6.4
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
zfec: 1.4.24
Twisted: 13.0.0
Nevow: 0.10.0
zope.interface: unknown
python: 2.7.3
platform: OpenBSD-5.2-i386-32bit-ELF
pyOpenSSL: 0.13
simplejson: 3.3.0
pycrypto: 2.6
pyasn1: 0.1.7
mock: 1.0.1
setuptools: 0.6c16dev4

# ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise
Test Pass 1
allmydata.test.test_runner
  RunNode
    test_client_no_noise ...                                               [OK]

-------------------------------------------------------------------------------
Ran 1 tests in 4.987s

PASSED (successes=1)
Test Pass 2
allmydata.test.test_runner
  RunNode
    test_client_no_noise ...                                               [OK]

-------------------------------------------------------------------------------
Ran 1 tests in 5.011s

PASSED (successes=1)
Test Pass 3
allmydata.test.test_runner
  RunNode
    test_client_no_noise ...                                            [ERROR]
                                           [ERROR]

===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs

allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x83fa9dac [0.00449967384338s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x83fc83e4>, 1377735752.361727), **{})()>

allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1 tests in 240.022s

Last edited at 2013-08-29T11:35:45Z by zooko (previous) (diff)

comment:13 Changed at 2013-08-29T08:45:11Z by sickness

I have to add that after the previous test exits, a process seems to still hang: tahoe --quiet start test_runner/RunNode/test_client_no_noise/c1

and it seems to sit there forever until I manually kill it :/

comment:14 Changed at 2013-08-29T11:41:04Z by zooko

Hm. Could you try editing the source code, i.e. the code of trunk/src/allmydata/test/test_runner.py, and add these lines somewhere near the top:

from twisted.internet import base
base.DelayedCall.debug = True

And then re-run that experiment?

comment:15 Changed at 2020-01-13T20:36:24Z by exarkun

  • Resolution set to wontfix
  • Status changed from new to closed

The OpenBSD builder is gone. This could be re-opened and fixed by a dedicated OpenBSD maintainer but won't be otherwise.

Note: See TracTickets for help on using tickets.