[tahoe-lafs-trac-stream] [Tahoe-LAFS] #1336: improve the mechanism that causes test nodes to exit even if not successfully stopped

Tahoe-LAFS trac at tahoe-lafs.org
Sun Aug 17 15:05:57 UTC 2014


#1336: improve the mechanism that causes test nodes to exit even if not
successfully stopped
----------------------------+--------------------------
     Reporter:  davidsarah  |      Owner:  daira
         Type:  defect      |     Status:  assigned
     Priority:  major       |  Milestone:  undecided
    Component:  code        |    Version:  1.8.1
   Resolution:              |   Keywords:  cleanup test
Launchpad Bug:              |
----------------------------+--------------------------
Changes (by daira):

 * owner:  davidsarah => daira
 * status:  new => assigned


Old description:

> [source:src/allmydata/test/test_runner.py] includes some tests (in the
> !RunNode class) for whether node processes can be successfully started
> and stopped. If stopping the node fails, we don't want the node process
> to be left running. (On Windows the process would hold open file handles
> that prevent the _trial_test directory from being deleted, interfering
> with subsequent test runs -- although currently these tests don't work on
> Windows anyway, as discussed below.)
>
> Currently this is done by writing a file, with the poorly-chosen name
> "suicide_prevention_hotline", in the node directory. If a node sees this
> file at startup, it will set a 1-second
> [http://twistedmatrix.com/documents/10.2.0/api/twisted.application.internet.TimerService.html
> periodic timer] ([source:src/allmydata/client.py#L154]) that each time it
> triggers, causes the node process to exit if either the file's mtime is
> more than 120 seconds ago, or the file no longer exists
> ([source:src/allmydata/client.py#L440]).
>
> There are several problems with this mechanism:
>
> * On slow machines, the node process may exit before the test had chance
> to stop it, causing a spurious test failure. This seems to be happening
> on the '!FranXois lenny-armv5tel' buildbot ([http://tahoe-
> lafs.org/buildbot/builders/FranXois%20lenny-
> armv5tel/builds/438/steps/test/logs/stdio example]).
> * There is no way to distinguish an exit due to this cause from the
> process being killed or exiting for another reason.
> * The name of the file is based on a very poor choice of metaphor, that
> is both unpleasant and misleading. (The existence of the file doesn't
> prevent the node from exiting, as the name might imply.)
>
> In addition, the tests of starting nodes don't work on Windows, because
> twistd doesn't daemonize or write the pid file on that platform. While
> that isn't directly due to this mechanism, it would be nice to redesign
> these tests in a way that does work on Windows (if we're not going to
> change the Windows behaviour to be more like Unix).

New description:

 [source:src/allmydata/test/test_runner.py] includes some tests (in the
 !RunNode class) for whether node processes can be successfully started and
 stopped. If stopping the node fails, we don't want the node process to be
 left running. (On Windows the process would hold open file handles that
 prevent the _trial_test directory from being deleted, interfering with
 subsequent test runs -- although currently these tests don't work on
 Windows anyway, as discussed below.)

 Currently this is done by writing a file, ~~with the poorly-chosen name
 "suicide_prevention_hotline"~~ called "exit_trigger", in the node
 directory. If a node sees this file at startup, it will set a 1-second
 [http://twistedmatrix.com/documents/10.2.0/api/twisted.application.internet.TimerService.html
 periodic timer] ([source:src/allmydata/client.py#L161]) that each time it
 triggers, causes the node process to exit if either the file's mtime is
 more than 120 seconds ago, or the file no longer exists
 ([source:src/allmydata/client.py#L498]).

 There are several problems with this mechanism:

 * On slow machines, the node process may exit before the test had chance
 to stop it, causing a spurious test failure. This seems to be happening on
 the '!FranXois lenny-armv5tel' buildbot ([http://tahoe-
 lafs.org/buildbot/builders/FranXois%20lenny-
 armv5tel/builds/438/steps/test/logs/stdio example]).
 * There is no way to distinguish an exit due to this cause from the
 process being killed or exiting for another reason.
 * ~~The name of the file is based on a very poor choice of metaphor, that
 is both unpleasant and misleading. (The existence of the file doesn't
 prevent the node from exiting, as the name might imply.)~~

 In addition, the tests of starting nodes don't work on Windows, because
 twistd doesn't daemonize or write the pid file on that platform. While
 that isn't directly due to this mechanism, it would be nice to redesign
 these tests in a way that does work on Windows (if we're not going to
 change the Windows behaviour to be more like Unix).

--

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1336#comment:2>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage


More information about the tahoe-lafs-trac-stream mailing list