[tahoe-dev] [tahoe-lafs] #846: allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a slow machine

tahoe-lafs trac at allmydata.org
Sat Nov 28 14:03:42 PST 2009


#846: allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a
slow machine
----------------------+-----------------------------------------------------
 Reporter:  zooko     |           Owner:  francois
     Type:  defect    |          Status:  new     
 Priority:  major     |       Milestone:  1.6.0   
Component:  unknown   |         Version:  1.5.0   
 Keywords:  test ARM  |   Launchpad_bug:          
----------------------+-----------------------------------------------------
 On François's lenny-armv5tel box,
 {{{allmydata.test.test_system.SystemTest.test_mutable}}} sometimes stops
 making progress and then gets timed out after 3600 seconds, e.g.:
 http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/16
 and many more.  In the cases where that test does pass it takes only a
 couple of hundred seconds, e.g.:
 http://allmydata.org/buildbot/builders/François lenny-
 armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds.  (In
 that same passing test run other tests took longer than 227 seconds -- see
 http://allmydata.org/buildbot/builders/François lenny-
 armv5tel/builds/8/steps/test/logs/timings .)

 Brian looked at the test.log files from passing and failing examples and
 said that there was little information there, but that one difference was
 that in the passing cases that he saw, the time between the beginning of
 the test case (e.g. {{{2009-11-20 18:08:54.346Z [-] -->
 allmydata.test.test_system.SystemTest.test_mutable <--}}}) and the first
 message from Node startup (e.g. {{{2009-11-20 18:08:55.475Z [-]
 foolscap.pb.Listener starting on 35403}}}) was about 1 second, and in the
 failing cases, e.g. start test {{{2009-11-28 13:36:48.970Z [-] -->
 allmydata.test.test_system.SystemTest.test_mutable <--}}} and Node startup
 {{{2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397}}}
 was about 5 seconds.

 So it could be that there is some sort of race condition where if it takes
 the Node longer than 5 seconds to start up (perhaps waiting to bind to a
 TCP port or something) then some other part of the test gets confused by
 having won a race that it didn't expect to win.

 Hm, I wonder if I could simulate that on a fast computer by inserting some
 sort of 10s delay before allowing Node startup to complete...

 The next step is to make this test reproducible.  François, could you
 please run just this one test, such as with {{{trial --reporter=verbose
 --until-failure allmydata.test.test_system.SystemTest.test_mutable}}} and
 see if you can tell when it passes vs. when it fails?  (Maybe it has to do
 with other processes loading the CPU?)  Note that which version of Tahoe-
 LAFS gets imported and tested by that command-line will be determined by
 your PYTHONPATH.

 François: I'd like to get this fixed so that ARM can be a supported
 platform for the upcoming v1.6 release, so if you ''can't'' do this soon
 then please either give me or Brian an ssh account on your box or just say
 "Can't work on this now" so that we can think of some alternative
 strategies.  Thanks!

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/846>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid


More information about the tahoe-dev mailing list