#1381 closed defect (fixed)

EINTR from communication with subprocess in allmydata/util/iputil.py _query

Reported by: davidsarah Owned by: zooko
Priority: major Milestone: 1.10.1
Component: code-network Version: 1.8.2
Keywords: iputil heisenbug review-needed Cc:
Launchpad Bug:

Description (last modified by zooko)

Reported by 'sickness' on irc:

#   Run
#     test_loadable ...                                                      [OK]
#     test_reloadable ... Node._startService failed, aborting
# [Failure instance: Traceback: <type 'exceptions.OSError'>: [Errno 4] Interrupted system call
# /usr/lib/python2.6/threading.py:497:__bootstrap
# /usr/lib/python2.6/threading.py:525:__bootstrap_inner
# /usr/lib/python2.6/threading.py:477:run
# --- <exception caught here> ---
# /usr/lib/python2.6/vendor-packages/twisted/python/threadpool.py:210:_worker
# /usr/lib/python2.6/vendor-packages/twisted/python/context.py:59:callWithContext
# /usr/lib/python2.6/vendor-packages/twisted/python/context.py:37:callWithContext
# /home/righdieg/allmydata-tahoe-1.8.2/src/allmydata/util/iputil.py:222:_synchronously_find_addresses_via_config
# /home/righdieg/allmydata-tahoe-1.8.2/src/allmydata/util/iputil.py:237:_query
# /usr/lib/python2.6/subprocess.py:689:communicate
# /usr/lib/python2.6/subprocess.py:1233:_communicate
# /usr/lib/python2.6/subprocess.py:1157:wait
# ]
# calling os.abort()

Possibly related: http://bugs.python.org/issue1068268 . It may be that the patch for that bug wasn't complete enough. EINTR failures are usually not very reproducible, but the fix is just to repeat the query until it works (or fails with a different error).

Change History (18)

comment:1 follow-up: Changed at 2011-03-22T21:28:16Z by sickness

The OS is opensolaris snv134 64bit

$ uname -a

SunOS MYWORKPC 5.11 snv_134 i86pc i386 i86pc Solaris

$ psrinfo -pv

The physical processor has 2 virtual processors (0 1)

x86 (GenuineIntel? 1067A family 6 model 23 step 10 clock 2800 MHz)

Pentium(r) Dual-Core CPU E6300 @ 2.80GHz

$ isainfo -x

amd64: ssse3 cx16 mon sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu

i386: ssse3 ahf cx16 mon sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu

This is instead the tahoe version:

$ allmydata-tahoe-1.8.2/bin/tahoe --version

allmydata-tahoe: 1.8.2,

foolscap: 0.6.1,

pycryptopp: 0.5.29,

zfec: 1.4.22,

Twisted: 8.2.0,

Nevow: 0.10.0,

zope.interface: unknown,

python: 2.6.4,

platform: SunOS-5.11-i86pc-i386-32bit-ELF,

pyOpenSSL: 0.11,

simplejson: 2.0.9,

pycrypto: 2.3,

pyasn1: unknown,

mock: 0.7.0,

sqlite3: 2.4.1 [sqlite 3.6.17],

setuptools: 0.6c16dev3

comment:2 in reply to: ↑ 1 Changed at 2011-03-23T01:26:42Z by davidsarah

Replying to sickness:

python: 2.6.4,

Hmm, that should have had the backported fix for http://bugs.python.org/issue1068268 . Oh well, we would need to work around it for earlier Pythons anyway.

comment:3 Changed at 2011-05-28T22:09:17Z by davidsarah

  • Keywords heisenbug added

comment:4 follow-up: Changed at 2011-05-29T04:32:59Z by zooko

Should we work-around this by catching OSError with errno==4 and retrying the subprocess?

comment:5 in reply to: ↑ 4 Changed at 2011-05-29T15:33:32Z by davidsarah

Replying to zooko:

Should we work-around this by catching OSError with errno==4 and retrying the subprocess?

Yes, I believe so. We probably shouldn't retry forever, so let's retry 10 times. The try/except should cover lines 236 and 237 of iputil.py.

BTW, rather than 4 we should use errno.EINTR (I think this is defined on all platforms, even though EINTR is only really relevant on Unix).

Should _query return [] (i.e. no addresses) if the subprocess fails? Oh, I see that issue is #854 ('what to do when you can't find any IP address for yourself').

comment:6 Changed at 2011-08-14T00:09:40Z by davidsarah

  • Milestone changed from 1.9.0 to 1.10.0

comment:7 Changed at 2011-08-14T00:09:58Z by davidsarah

  • Status changed from new to assigned

comment:8 Changed at 2013-05-27T17:29:35Z by zooko

  • Description modified (diff)

See #1988

comment:9 follow-up: Changed at 2013-05-27T20:50:38Z by daira

This is a separate bug to #1988, though. The correct fix is to retry.

comment:10 in reply to: ↑ 9 Changed at 2013-05-29T17:26:28Z by zooko

Replying to daira:

This is a separate bug to #1988, though. The correct fix is to retry.

Agreed.

comment:11 follow-up: Changed at 2013-05-30T18:36:06Z by daira

  • Keywords review-needed added
  • Owner davidsarah deleted
  • Status changed from assigned to new

comment:12 in reply to: ↑ 11 Changed at 2013-06-06T21:10:05Z by zooko

  • Keywords review-needed removed
  • Owner set to daira

Replying to daira:

Review needed for https://github.com/daira/tahoe-lafs/commits/refactor-address-finding.

Not ready yet, tests fail.

comment:13 Changed at 2013-06-14T23:52:37Z by daira

Oops, I accidentally committed the patch for this while committing the reviewed fix for #1717. Sorry :-(

I'll fix the tests next.

comment:14 Changed at 2013-06-25T18:15:57Z by Daira Hopwood <david-sarah@…>

In a493ee0bb641175ecf918e28fce4d25df15994b6/trunk:

iputil.py: add tests for recent changes. refs #1381, #1988, #982, #1064, #1536, #1935, #898, #1707, #1918

Signed-off-by: Daira Hopwood <david-sarah@…>

comment:15 Changed at 2013-06-27T02:09:40Z by daira

  • Owner changed from daira to zooko

The tests are fixed, but this still needs review. The relevant patches for the bugfix are 6a445d73 and 6104950e. It is tested by simulating an EINTR on the first call to subprocess.Popen, in each of the new test_list_async_mock_* tests.

comment:16 Changed at 2013-06-27T16:55:44Z by daira

  • Keywords review-needed added

comment:17 Changed at 2013-07-12T17:32:41Z by markberger

+1

comment:18 Changed at 2013-07-17T13:00:09Z by daira

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.