[tahoe-lafs-trac-stream] [Tahoe-LAFS] #3025: Sometimes the test suite hangs in/around allmydata.test.test_runner.RunNode.test_introducer
Tahoe-LAFS
trac at tahoe-lafs.org
Thu May 2 14:44:17 UTC 2019
#3025: Sometimes the test suite hangs in/around
allmydata.test.test_runner.RunNode.test_introducer
-------------------------+-----------------------
Reporter: exarkun | Owner:
Type: defect | Status: new
Priority: normal | Milestone: undecided
Component: unknown | Version: 1.12.1
Resolution: | Keywords:
Launchpad Bug: |
-------------------------+-----------------------
Comment (by exarkun):
I managed to catch this live on CI for inspection (strangely the branch
for https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3038 produces this
failure much more regularly).
ps output looked like this:
{{{
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nobody 6 0.1 0.0 123848 18336 ? Ssl 13:32 0:00 /bin
/circleci-agent --config /.circleci-runner-config.json --task-data
/.circleci-task-data --outerServerUrl https://circleci-internal-outer-
build-a
gent:5500 _internal step --sshAdvertiseAddr=3.95.66.234:64535
--sshBindPort=54782
nobody 88 0.0 0.0 21340 3600 pts/1 Ss+ 13:32 0:00 \_
/bin/bash /tmp/project/.circleci/run-tests.sh /tmp/venv /tmp/project
/tmp/artifacts coverage
nobody 90 0.0 0.0 64456 22140 pts/1 S+ 13:32 0:00 | \_
/tmp/venv/bin/python2.7 /tmp/venv/bin/tox -c /tmp/project/tox.ini
--workdir /tmp/tahoe-lafs.tox -e coverage
nobody 110 47.8 0.3 620428 282520 pts/1 S+ 13:32 4:03 |
\_ /tmp/tahoe-lafs.tox/coverage/bin/python /tmp/tahoe-
lafs.tox/coverage/bin/coverage run --branch -m twisted.trial
--reporter=subunitv2-fil
e --rterrors allmydata
nobody 431 0.3 0.1 212320 103772 pts/1 S+ 13:37 0:00 |
\_ /tmp/tahoe-lafs.tox/coverage/bin/python -m allmydata.scripts.runner
--quiet restart test_runner/RunNode/test_client/c1
nobody 441 0.0 0.0 21472 3680 pts/2 Ss 13:41 0:00 \_
/bin/bash
nobody 444 0.0 0.0 37364 3304 pts/2 R+ 13:41 0:00 \_
ps fauxwww
nobody 1 0.0 0.0 4628 856 pts/0 Ss+ 13:32 0:00 /bin/sh
nobody 429 0.0 0.0 0 0 ? Zs 13:37 0:00 [python]
<defunct>
nobody 430 0.3 0.0 0 0 ? Z 13:37 0:00 [python]
<defunct>
}}}
`test_runner/RunNode/test_client/c1/logs/twistd.log` contains:
{{{
2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info]
twistd 19.2.0 (/tmp/tahoe-lafs.tox/coverage/bin/python 2.7.15) starting
up.
2019-05-02T13:37:53+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info]
reactor class: twisted.internet.epollreactor.EPollReactor.
2019-05-02T13:37:54+0000 [-] Foolscap logging initialized
2019-05-02T13:37:54+0000 [-] Note to developers: twistd.log does not
receive very much.
2019-05-02T13:37:54+0000 [-] Use 'flogtool tail -c
NODEDIR/private/logport.furl' instead
2019-05-02T13:37:54+0000 [-] and read docs/logging.rst
2019-05-02T13:37:54+0000 [-] NevowSite starting on 41427
2019-05-02T13:37:54+0000 [nevow.appserver.NevowSite#info] Starting factory
<nevow.appserver.NevowSite instance at 0x7f7a7fd8e098>
2019-05-02T13:37:54+0000 [-] My pid: 430
2019-05-02T13:37:54+0000 [-] Listener starting on 35327
2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory
<Listener at 0x7f7a7fee1a10 on
CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint
object at 0x7f7a7fee1990>, _fd=14, _listened=True) with tub
q7kkkqqglfxfgfcae2fhnq2wzmm6wrno>
2019-05-02T13:37:54+0000 [-] Listener starting on 34521
2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory
<Listener at 0x7f7a7fee1490 on
<twisted.internet.endpoints.TCP4ServerEndpoint object at 0x7f7a7fee14d0>
with tub lytxba7m4nb3jfcjqmhcc77wu5m5in5n>
2019-05-02T13:37:54+0000 [-] Listener starting on 38229
2019-05-02T13:37:54+0000 [foolscap.pb.Listener#info] Starting factory
<Listener at 0x7f7a7fee1690 on
CleanupEndpoint(_wrapped=<twisted.internet.endpoints.AdoptedStreamServerEndpoint
object at 0x7f7a7fee1350>, _fd=13, _listened=True) with tub
jguyasayigt7a3igreb7h7sdxcsnqhkw>
2019-05-02T13:37:54+0000 [-] client running
}}}
Note this reports the pid of the tahoe process for that directory as 430
which the ps listing reports as a zombie. I wasn't able to find a log
that told me who pid 429 was but possibly that's the parent of the
daemonized 430.
I also noticed that as a little time passed, more zombies appeared:
{{{
nobody 429 0.0 0.0 0 0 ? Zs 13:37 0:00 [python]
<defunct>
nobody 430 0.0 0.0 0 0 ? Z 13:37 0:00 [python]
<defunct>
nobody 431 0.0 0.0 0 0 ? Z 13:37 0:00 [python]
<defunct>
nobody 468 0.0 0.0 0 0 ? Zs 13:41 0:00 [python]
<defunct>
nobody 469 0.0 0.0 0 0 ? Z 13:41 0:00 [python]
<defunct>
nobody 470 0.0 0.0 0 0 ? Z 13:41 0:00 [python]
<defunct>
nobody 514 0.0 0.0 0 0 ? Zs 13:45 0:00 [python]
<defunct>
nobody 515 0.0 0.0 0 0 ? Z 13:45 0:00 [python]
<defunct>
nobody 516 0.0 0.0 0 0 ? Z 13:45 0:01 [python]
<defunct>
}}}
{{{
nobody at 9a341290331a:/tmp/project/_trial_temp$ grep 'My pid' ./ -r
./test.log:2019-05-02 13:35:05.396Z [-] My pid: 110
./test.log:2019-05-02 13:35:09.600Z [-] My pid: 110
./test_runner/RunNode/test_client_no_noise/c1/logs/twistd.log:2019-05-02T13:41:54+0000
[-] My pid: 469
./test_runner/RunNode/test_introducer/c1/logs/twistd.log:2019-05-02T13:45:54+0000
[-] My pid: 515
./test_runner/RunNode/test_client/c1/logs/twistd.log:2019-05-02T13:37:54+0000
[-] My pid: 430
}}}
469, 515, and 430 are all zombies.
This suggests to me that the test suite isn't actually hanging, it's just
going too slowly to satisfy the CI runner. It also suggests to me that
the CI environment itself is tied up with the problem: why are these
processes zombies? It is probably the CI environment's fault that they
are not being reaped. That said, why doesn't `python -m
allmydata.scripts.runner --quiet restart
test_runner/RunNode/test_client/c1` return when the process is dead?
Maybe it is waiting for it to be reaped as well, I suppose.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/3025#comment:3>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list