#3025 closed defect (fixed)

Sometimes the test suite hangs in/around allmydata.test.test_runner.RunNode.test_introducer

Reported by: exarkun Owned by: GitHub <noreply@…>
Priority: normal Milestone: undecided
Component: unknown Version: 1.12.1
Keywords: review-needed Cc:
Launchpad Bug:

Description

This is similar to https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3021

Now, though, instead of hanging on the 2nd RunTest test method, it hangs on the 4th.

Change History (7)

comment:1 Changed at 2019-04-09T13:41:43Z by exarkun

With some logging added I observed:

2019-04-09 13:19:32.572Z [-] --> allmydata.test.test_runner.RunNode.test_introducer <--
2019-04-09 13:19:33.693Z [-] create-introducer: ('', '', 0)
2019-04-09 13:19:36.825Z [-] start: ('', '', 0)

and then a hang. So the hang happens after the "tahoe start" succeeds but before the "tahoe restart" succeed.

comment:2 Changed at 2019-04-10T13:42:37Z by exarkun

With some more logging added, a failed CI run had this to say:

2019-04-09 16:19:12.155Z [-] --> allmydata.test.test_runner.RunNode.test_baddir <--
2019-04-09 16:19:15.241Z [-] Main loop terminated.
2019-04-09 16:19:15.242Z [-] --> allmydata.test.test_runner.RunNode.test_client <--
2019-04-09 16:19:19.160Z [-] Polling <function _node_has_started at 0x7f6d596cd410>
2019-04-09 16:19:19.160Z [-] Polling completed successfully
2019-04-09 16:21:15.256Z [-] Main loop terminated.
2019-04-09 16:21:15.257Z [-] --> allmydata.test.test_runner.RunNode.test_client_no_noise <--
2019-04-09 16:21:19.093Z [-] Polling <function _node_has_started at 0x7f6d5aa02e60>
2019-04-09 16:21:19.094Z [-] Polling completed successfully

This is a hang in the _third_ test on the test case, not the fourth. Unfortunately the logging in that test is not as comprehensive (heh) as in test_introducer so all I can really surmise is that the hang happened after "tahoe start" completed and before "tahoe stop" completed.

This is somewhat similar to the test_introducer behavior, I suppose, since in that case it's "tahoe restart" that doesn't complete and "restart" implies "stop".

comment:3 Changed at 2019-05-02T14:44:17Z by exarkun

I managed to catch this live on CI for inspection (strangely the branch for https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3038 produces this failure much more regularly).

ps output looked like this:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND                                                                                                                                            
nobody       6  0.1  0.0 123848 18336 ?        Ssl  13:32   0:00 /bin/circleci-agent --config /.circleci-runner-config.json --task-data /.circleci-task-data --outerServerUrl https://circleci-internal-outer-build-a
gent:5500 _internal step --sshAdvertiseAddr=3.95.66.234:64535 --sshBindPort=54782                                                                                                  
nobody      88  0.0  0.0  21340  3600 pts/1    Ss+  13:32   0:00  \_ /bin/bash /tmp/project/.circleci/run-tests.sh /tmp/venv /tmp/project /tmp/artifacts coverage                                                   
nobody      90  0.0  0.0  64456 22140 pts/1    S+   13:32   0:00  |   \_ /tmp/venv/bin/python2.7 /tmp/venv/bin/tox -c /tmp/project/tox.ini --workdir /tmp/tahoe-lafs.tox -e coverage            
nobody     110 47.8  0.3 620428 282520 pts/1   S+   13:32   4:03  |       \_ /tmp/tahoe-lafs.tox/coverage/bin/python /tmp/tahoe-lafs.tox/coverage/bin/coverage run --branch -m twisted.trial --reporter=subunitv2-fil
e --rterrors allmydata                                                                                                                                                                                               
nobody     431  0.3  0.1 212320 103772 pts/1   S+   13:37   0:00  |           \_ /tmp/tahoe-lafs.tox/coverage/bin/python -m allmydata.scripts.runner --quiet restart test_runner/RunNode/test_client/c1
nobody     441  0.0  0.0  21472  3680 pts/2    Ss   13:41   0:00  \_ /bin/bash                                                                                   
nobody     444  0.0  0.0  37364  3304 pts/2    R+   13:41   0:00      \_ ps fauxwww                                                                                                             
nobody       1  0.0  0.0   4628   856 pts/0    Ss+  13:32   0:00 /bin/sh                                                                                                                                             
nobody     429  0.0  0.0      0     0 ?        Zs   13:37   0:00 [python] <defunct>                                                                   
nobody     430  0.3  0.0      0     0 ?        Z    13:37   0:00 [python] <defunct>                                                                                                                                  

test_runner/RunNode/test_client/c1/logs/twistd.log contains:

2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 19.2.0 (/tmp/tahoe-lafs.tox/coverage/bin/python 2.7.15) starting up.
2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2019-05-02T13:37:54+0000 [-] Foolscap logging initialized
2019-05-02T13:37:54+0000 [-] Note to developers: twistd.log does not receive very much.
2019-05-02T13:37:54+0000 [-] Use 'flogtool tail -c NODEDIR/private/logport.furl' instead
2019-05-02T13:37:54+0000 [-] and read docs/logging.rst
2019-05-02T13:37:54+0000 [-] NevowSite starting on 41427
2019-05-02T13:37:54+0000 [nevow.appserver.NevowSite#info] Starting factory <nevow.appserver.NevowSite instance at 0x7f7a7fd8e098>
2019-05-02T13:37:54+0000 [-] My pid: 430
2019-05-02T13:37:54+0000 [-] Listener starting on 35327
2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory <Listener at 0x7f7a7fee1a10 on CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint object at 0x7f7a7fee1990>, _fd=14, _listened=True) with tub q7kkkqqglfxfgfcae2fhnq2wzmm6wrno>
2019-05-02T13:37:54+0000 [-] Listener starting on 34521
2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory <Listener at 0x7f7a7fee1490 on <twisted.internet.endpoints.TCP4ServerEndpoint object at 0x7f7a7fee14d0> with tub lytxba7m4nb3jfcjqmhcc77wu5m5in5n>
2019-05-02T13:37:54+0000 [-] Listener starting on 38229
2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory <Listener at 0x7f7a7fee1690 on CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint object at 0x7f7a7fee1350>, _fd=13, _listened=True) with tub jguyasayigt7a3igreb7h7sdxcsnqhkw>
2019-05-02T13:37:54+0000 [-] client running

Note this reports the pid of the tahoe process for that directory as 430 which the ps listing reports as a zombie. I wasn't able to find a log that told me who pid 429 was but possibly that's the parent of the daemonized 430.

I also noticed that as a little time passed, more zombies appeared:

nobody     429  0.0  0.0      0     0 ?        Zs   13:37   0:00 [python] <defunct>
nobody     430  0.0  0.0      0     0 ?        Z    13:37   0:00 [python] <defunct>
nobody     431  0.0  0.0      0     0 ?        Z    13:37   0:00 [python] <defunct>
nobody     468  0.0  0.0      0     0 ?        Zs   13:41   0:00 [python] <defunct>
nobody     469  0.0  0.0      0     0 ?        Z    13:41   0:00 [python] <defunct>
nobody     470  0.0  0.0      0     0 ?        Z    13:41   0:00 [python] <defunct>
nobody     514  0.0  0.0      0     0 ?        Zs   13:45   0:00 [python] <defunct>
nobody     515  0.0  0.0      0     0 ?        Z    13:45   0:00 [python] <defunct>
nobody     516  0.0  0.0      0     0 ?        Z    13:45   0:01 [python] <defunct>
nobody@9a341290331a:/tmp/project/_trial_temp$ grep 'My pid' ./ -r
./test.log:2019-05-02 13:35:05.396Z [-] My pid: 110
./test.log:2019-05-02 13:35:09.600Z [-] My pid: 110
./test_runner/RunNode/test_client_no_noise/c1/logs/twistd.log:2019-05-02T13:41:54+0000 [-] My pid: 469
./test_runner/RunNode/test_introducer/c1/logs/twistd.log:2019-05-02T13:45:54+0000 [-] My pid: 515
./test_runner/RunNode/test_client/c1/logs/twistd.log:2019-05-02T13:37:54+0000 [-] My pid: 430

469, 515, and 430 are all zombies.

This suggests to me that the test suite isn't actually hanging, it's just going too slowly to satisfy the CI runner. It also suggests to me that the CI environment itself is tied up with the problem: why are these processes zombies? It is probably the CI environment's fault that they are not being reaped. That said, why doesn't python -m allmydata.scripts.runner --quiet restart test_runner/RunNode/test_client/c1 return when the process is dead? Maybe it is waiting for it to be reaped as well, I suppose.

comment:4 Changed at 2019-05-02T14:49:07Z by exarkun

Note:

$ ps -p 1
  PID TTY          TIME CMD
    1 pts/0    00:00:00 sh

comment:5 Changed at 2019-05-03T15:51:02Z by exarkun

So, my interpretation here is that "tahoe start" doesn't work in the CI environment because it performs standard *NIX daemonization. This relies on having a correctly functioning init process which will eventually reap the zombie child process. A number of our CI environments don't have a correctly functioning init process so the zombies linger forever. "tahoe restart" and "tahoe stop" can't deal with this situation. They wait forever for the zombies to get reaped. This leads the test suite to time out.

One possible fix is to give the CI environments a correctly functioning init process. That's probably a good idea.

Another possible fix is to stop using "tahoe start" in these tests and use the non-daemonizing "tahoe run" instead. We know that we're trying to switch from "start" to "run" so this seems like a useful step in two ways.

Probably should eventually do both but for now I'm going to do the latter.

comment:6 Changed at 2019-05-14T06:17:44Z by exarkun

  • Keywords review-needed added

comment:7 Changed at 2019-05-14T08:13:08Z by GitHub <noreply@…>

  • Owner set to GitHub <noreply@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 3193611/trunk:

Merge pull request #611 from tahoe-lafs/3025.fix-test_runner-hangs

Replace "tahoe start" with "tahoe run" in some of test_runner to fix CI hangs

Fixes: ticket:3025

Note: See TracTickets for help on using tickets.