[tahoe-lafs-trac-stream] [Tahoe-LAFS] #3025: Sometimes the test suite hangs in/around allmydata.test.test_runner.RunNode.test_introducer

Thu May 2 14:44:17 UTC 2019

#3025: Sometimes the test suite hangs in/around
allmydata.test.test_runner.RunNode.test_introducer
-------------------------+-----------------------
     Reporter:  exarkun  |      Owner:
         Type:  defect   |     Status:  new
     Priority:  normal   |  Milestone:  undecided
    Component:  unknown  |    Version:  1.12.1
   Resolution:           |   Keywords:
Launchpad Bug:           |
-------------------------+-----------------------

Comment (by exarkun):

 I managed to catch this live on CI for inspection (strangely the branch
 for https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3038 produces this
 failure much more regularly).

 ps output looked like this:

 {{{
 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 nobody       6  0.1  0.0 123848 18336 ?        Ssl  13:32   0:00 /bin
 /circleci-agent --config /.circleci-runner-config.json --task-data
 /.circleci-task-data --outerServerUrl https://circleci-internal-outer-
 build-a
 gent:5500 _internal step --sshAdvertiseAddr=3.95.66.234:64535
 --sshBindPort=54782
 nobody      88  0.0  0.0  21340  3600 pts/1    Ss+  13:32   0:00  \_
 /bin/bash /tmp/project/.circleci/run-tests.sh /tmp/venv /tmp/project
 /tmp/artifacts coverage
 nobody      90  0.0  0.0  64456 22140 pts/1    S+   13:32   0:00  |   \_
 /tmp/venv/bin/python2.7 /tmp/venv/bin/tox -c /tmp/project/tox.ini
 --workdir /tmp/tahoe-lafs.tox -e coverage
 nobody     110 47.8  0.3 620428 282520 pts/1   S+   13:32   4:03  |
 \_ /tmp/tahoe-lafs.tox/coverage/bin/python /tmp/tahoe-
 lafs.tox/coverage/bin/coverage run --branch -m twisted.trial
 --reporter=subunitv2-fil
 e --rterrors allmydata
 nobody     431  0.3  0.1 212320 103772 pts/1   S+   13:37   0:00  |
 \_ /tmp/tahoe-lafs.tox/coverage/bin/python -m allmydata.scripts.runner
 --quiet restart test_runner/RunNode/test_client/c1
 nobody     441  0.0  0.0  21472  3680 pts/2    Ss   13:41   0:00  \_
 /bin/bash
 nobody     444  0.0  0.0  37364  3304 pts/2    R+   13:41   0:00      \_
 ps fauxwww
 nobody       1  0.0  0.0   4628   856 pts/0    Ss+  13:32   0:00 /bin/sh
 nobody     429  0.0  0.0      0     0 ?        Zs   13:37   0:00 [python]
 <defunct>
 nobody     430  0.3  0.0      0     0 ?        Z    13:37   0:00 [python]
 <defunct>
 }}}

 `test_runner/RunNode/test_client/c1/logs/twistd.log` contains:

 {{{
 2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info]
 twistd 19.2.0 (/tmp/tahoe-lafs.tox/coverage/bin/python 2.7.15) starting
 up.
 2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info]
 reactor class: twisted.internet.epollreactor.EPollReactor.
 2019-05-02T13:37:54+0000 [-] Foolscap logging initialized
 2019-05-02T13:37:54+0000 [-] Note to developers: twistd.log does not
 receive very much.
 2019-05-02T13:37:54+0000 [-] Use 'flogtool tail -c
 NODEDIR/private/logport.furl' instead
 2019-05-02T13:37:54+0000 [-] and read docs/logging.rst
 2019-05-02T13:37:54+0000 [-] NevowSite starting on 41427
 2019-05-02T13:37:54+0000 [nevow.appserver.NevowSite#info] Starting factory
 <nevow.appserver.NevowSite instance at 0x7f7a7fd8e098>
 2019-05-02T13:37:54+0000 [-] My pid: 430
 2019-05-02T13:37:54+0000 [-] Listener starting on 35327
 2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory
 <Listener at 0x7f7a7fee1a10 on
 CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint
 object at 0x7f7a7fee1990>, _fd=14, _listened=True) with tub
 q7kkkqqglfxfgfcae2fhnq2wzmm6wrno>
 2019-05-02T13:37:54+0000 [-] Listener starting on 34521
 2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory
 <Listener at 0x7f7a7fee1490 on
 <twisted.internet.endpoints.TCP4ServerEndpoint object at 0x7f7a7fee14d0>
 with tub lytxba7m4nb3jfcjqmhcc77wu5m5in5n>
 2019-05-02T13:37:54+0000 [-] Listener starting on 38229
 2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory
 <Listener at 0x7f7a7fee1690 on
 CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint
 object at 0x7f7a7fee1350>, _fd=13, _listened=True) with tub
 jguyasayigt7a3igreb7h7sdxcsnqhkw>
 2019-05-02T13:37:54+0000 [-] client running
 }}}

 Note this reports the pid of the tahoe process for that directory as 430
 which the ps listing reports as a zombie.  I wasn't able to find a log
 that told me who pid 429 was but possibly that's the parent of the
 daemonized 430.

 I also noticed that as a little time passed, more zombies appeared:

 {{{
 nobody     429  0.0  0.0      0     0 ?        Zs   13:37   0:00 [python]
 <defunct>
 nobody     430  0.0  0.0      0     0 ?        Z    13:37   0:00 [python]
 <defunct>
 nobody     431  0.0  0.0      0     0 ?        Z    13:37   0:00 [python]
 <defunct>
 nobody     468  0.0  0.0      0     0 ?        Zs   13:41   0:00 [python]
 <defunct>
 nobody     469  0.0  0.0      0     0 ?        Z    13:41   0:00 [python]
 <defunct>
 nobody     470  0.0  0.0      0     0 ?        Z    13:41   0:00 [python]
 <defunct>
 nobody     514  0.0  0.0      0     0 ?        Zs   13:45   0:00 [python]
 <defunct>
 nobody     515  0.0  0.0      0     0 ?        Z    13:45   0:00 [python]
 <defunct>
 nobody     516  0.0  0.0      0     0 ?        Z    13:45   0:01 [python]
 <defunct>
 }}}

 {{{
 nobody at 9a341290331a:/tmp/project/_trial_temp$ grep 'My pid' ./ -r
 ./test.log:2019-05-02 13:35:05.396Z [-] My pid: 110
 ./test.log:2019-05-02 13:35:09.600Z [-] My pid: 110
 ./test_runner/RunNode/test_client_no_noise/c1/logs/twistd.log:2019-05-02T13:41:54+0000
 [-] My pid: 469
 ./test_runner/RunNode/test_introducer/c1/logs/twistd.log:2019-05-02T13:45:54+0000
 [-] My pid: 515
 ./test_runner/RunNode/test_client/c1/logs/twistd.log:2019-05-02T13:37:54+0000
 [-] My pid: 430
 }}}

 469, 515, and 430 are all zombies.

 This suggests to me that the test suite isn't actually hanging, it's just
 going too slowly to satisfy the CI runner.  It also suggests to me that
 the CI environment itself is tied up with the problem: why are these
 processes zombies?  It is probably the CI environment's fault that they
 are not being reaped.  That said, why doesn't `python -m
 allmydata.scripts.runner --quiet restart
 test_runner/RunNode/test_client/c1` return when the process is dead?
 Maybe it is waiting for it to be reaped as well, I suppose.

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3025#comment:3>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage