#250 closed defect (fixed)

memcheck-64 fails sporadically

Reported by: zooko Owned by: warner
Priority: major Milestone: 1.4.1
Component: operational Version: 0.7.0
Keywords: Cc:
Launchpad Bug:

Description

Brian knows a little bit more about this. There's some sort of race condition in shutting down old test runs and starting new ones, or something like that.

Change History (8)

comment:1 Changed at 2008-05-09T01:22:00Z by warner

we fixed one possible source of failures: the pre-determined webport. This should fix failures that say "address already in use" in the nodelog. Let's watch the buildbot and see if any new failures show up.

comment:2 Changed at 2008-05-31T01:21:24Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

I think this has been fixed.

comment:3 Changed at 2008-05-31T01:21:29Z by zooko

  • Milestone changed from undecided to 1.1.0

comment:4 Changed at 2008-07-14T16:09:21Z by zooko

  • Resolution fixed deleted
  • Status changed from closed to reopened

This just happened on a different builder:

http://allmydata.org/buildbot/builders/feisty2.5/builds/1557/steps/test/logs/stdio

allmydata.test.test_client.Run.test_reloadable ... Traceback (most recent call last):
  File "/home/buildslave/tahoe/feisty2.5/build/src/allmydata/test/test_client.py", line 194, in _restart
    c2.setServiceParent(self.sparent)
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 148, in setServiceParent
    self.parent.addService(self)
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 259, in addService
    service.privilegedStartService()
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 228, in privilegedStartService
    service.privilegedStartService()
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 228, in privilegedStartService
    service.privilegedStartService()
  File "/usr/lib/python2.5/site-packages/twisted/application/internet.py", line 68, in privilegedStartService
    self._port = self._getPort()
  File "/usr/lib/python2.5/site-packages/twisted/application/internet.py", line 86, in _getPort
    return getattr(reactor, 'listen'+self.method)(*self.args, **self.kwargs)
  File "/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py", line 467, in listenTCP
    p.startListening()
  File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 733, in startListening
    raise CannotListenError, (self.interface, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on any:43755: (98, 'Address already in use').

Is it possible that this fault happens whenever the same port number is chosen at random by two successive tests?

comment:5 Changed at 2008-07-14T19:12:47Z by warner

There are comments in the test that have more detail.. the issue is some absolute timeouts that were not easy to get rid of. The problem is most likely the old instance not completely shutting down before the new one is started up.

source:/src/allmydata/test/test_client.py@2712#L166 has details:

def test_reloadable(self):
    basedir = "test_client.Run.test_reloadable"
    os.mkdir(basedir)
    dummy = "pb://wl74cyahejagspqgy4x5ukrvfnevlknt@127.0.0.1:58889/bogus"
    open(os.path.join(basedir, "introducer.furl"), "w").write(dummy)
    c1 = client.Client(basedir)
    c1.setServiceParent(self.sparent)

    # delay to let the service start up completely. I'm not entirely sure
    # this is necessary.
    d = self.stall(delay=2.0)
    d.addCallback(lambda res: c1.disownServiceParent())
    # the cygwin buildslave seems to need more time to let the old
    # service completely shut down. When delay=0.1, I saw this test fail,
    # probably due to the logport trying to reclaim the old socket
    # number. This suggests that either we're dropping a Deferred
    # somewhere in the shutdown sequence, or that cygwin is just cranky.
    d.addCallback(self.stall, delay=2.0)
    def _restart(res):
        # TODO: pause for slightly over one second, to let
        # Client._check_hotline poll the file once. That will exercise
        # another few lines. Then add another test in which we don't
        # update the file at all, and watch to see the node shutdown. (to
        # do this, use a modified node which overrides Node.shutdown(),
        # also change _check_hotline to use it instead of a raw
        # reactor.stop, also instrument the shutdown event in an
        # attribute that we can check)
        c2 = client.Client(basedir)
        c2.setServiceParent(self.sparent)
        return c2.disownServiceParent()
    d.addCallback(_restart)
    return d

comment:6 Changed at 2008-07-14T22:30:58Z by warner

  • Milestone changed from 1.1.0 to 1.1.1

comment:7 Changed at 2009-03-28T19:43:06Z by zooko

Hm.. This ticket was last touched 9 months ago. I haven't been seeing this failure in practice recently, as far as I recall. Close this as fixed?

comment:8 Changed at 2009-04-08T02:18:08Z by warner

  • Resolution set to fixed
  • Status changed from reopened to closed

I don't remember seeing this failure for a while either. I think it's safe to close.. feel free to reopen if it appears again.

Note: See TracTickets for help on using tickets.