#807 closed defect (cannot reproduce)

exceptions.OverflowError: join() result is too long for a Python string

Reported by: zooko Owned by: warner
Priority: major Milestone: undecided
Component: code Version: 1.3.0
Keywords: logging foolscap Cc: warner
Launchpad Bug:

Description

Here is the complete contents from one "Log opened." to the next of tahoebs5.allmydata.com:/home/amduser/public/bs5c1/logs/twistd.log:

2009/07/15 06:30 -0700 [-] Log opened.
2009/07/15 06:30 -0700 [-] twistd 2.5.0 (/usr/bin/python 2.5.2) starting up
2009/07/15 06:30 -0700 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'>
2009/07/15 06:30 -0700 [-] Loading tahoe-client.tac...
2009-07-15 13:30:44.135Z [-] Loaded.
2009-07-15 13:30:44.139Z [-] foolscap.pb.Listener starting on 45963
2009-07-15 13:30:44.141Z [-] nevow.appserver.NevowSite starting on 9051
2009-07-15 13:30:44.141Z [-] Starting factory <nevow.appserver.NevowSite instance at 0x8b7e74c>
2009-07-15 13:30:44.145Z [-] twisted.internet.protocol.DatagramProtocol starting on 50791
2009-07-15 13:30:44.145Z [-] Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8b7ec2c>
2009-07-15 13:30:44.161Z [-] (Port 50791 Closed)
2009-07-15 13:30:44.162Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8b7ec2c>
2009-07-27 05:00:32.314Z [Negotiation,158,98.202.225.214] got banana ERROR from remote side: internal server error, see logs
2009-08-07 00:49:15.226Z [-] Unhandled error in Deferred:
2009-08-07 00:49:15.259Z [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: Connection was lost

2009-08-31 01:56:53.611Z [-] Unhandled error in Deferred:
2009-08-31 01:56:53.635Z [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: Connection was lost

2009-08-31 01:58:11.104Z [-] Unhandled error in Deferred:
2009-08-31 01:58:11.104Z [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: Connection was lost

2009-08-31 12:32:18.472Z [-] Unhandled error in Deferred:
2009-08-31 12:32:18.496Z [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: Connection was lost

2009-09-11 03:10:34.056Z [Negotiation,1031,97.118.104.193] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 48, in callWithLogger
            return callWithContext({"system": lp}, func, *args, **kw)
          File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 33, in callWithContext
            return context.call({ILogContext: newCtx}, func, *args, **kw)
          File "/usr/lib/python2.5/site-packages/twisted/python/context.py", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.5/site-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/selectreactor.py", line 139, in _doReadOrWrite
            why = getattr(selectable, method)()
          File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 154, in doWrite
            return Connection.doWrite(self)
          File "/usr/lib/python2.5/site-packages/twisted/internet/abstract.py", line 104, in doWrite
            self.dataBuffer = buffer(self.dataBuffer, self.offset) + "".join(self._tempDataBuffer)
        exceptions.OverflowError: join() result is too long for a Python string

Here is the startup version announcement from an incident report file which was recorded on 2009-06-02:

Application versions (embedded in logfile):
          Nevow: 0.9.26
        Twisted: 2.5.0
allmydata-tahoe: 1.3.0-r3747
       argparse: 0.8.0
       foolscap: 0.3.2
       platform: Linux-Ubuntu_8.04-i686-32bit
      pyOpenSSL: 0.6
     pycryptopp: 0.5.2-1
         python: 2.5.2
         pyutil: 1.3.16-12
     setuptools: 0.6c8
     simplejson: 1.7.3
        twisted: 2.5.0
           zfec: 1.4.0-4
 zope.interface: 3.3.1
PID: 4189

Change History (7)

comment:1 Changed at 2009-10-27T03:39:47Z by zooko

  • Version changed from 1.5.0 to 1.3.0

comment:2 Changed at 2009-10-27T04:05:37Z by zooko

  • Owner changed from somebody to warner

http://svn.python.org/view/python/tags/r252/Objects/stringobject.c?revision=60915&view=markup

Search in text for "join() result is too long for a Python string". It is guarded by if (sz < old_sz || sz > PY_SSIZE_T_MAX). PY_SSIZE_T_MAX is defined in http://svn.python.org/view/python/tags/r252/Include/pyport.h?revision=60915&view=markup to be ((size_t)-1)>>1 which would be 2,147,483,647 on that platform. So, I guess we were trying to .join() together a string that would have been more than 2.1 billion bytes.

I don't see how to investigate this, reproduce it, or determine if it has been fixed in newer versions of Tahoe-LAFS. One of the reasons why not is that the exception raised by the selectreactor's _doReadOrWrite() apparently didn't get propagated to foolscap, because no accompanying incident report file was generated.

Brian: am I interpreting that correctly? Is there a way to make sure that all unhandled exceptions get registered with the foolscap logging system so that they can be reported as incidents? Do you have any other ideas how to learn more about this issue, or should we just close it as "wontfix"?

comment:3 Changed at 2009-12-13T02:21:24Z by davidsarah

  • Keywords logging foolscap added

comment:4 Changed at 2010-04-13T04:18:03Z by zooko

Brian: I'm still concerned about the meta-issue here. Can we somehow ensure that all exceptions get logged as foolscap logging incidents or at least as twistd.log lines? I really don't like the feeling that exceptions silently disappear sometimes.

comment:5 Changed at 2010-04-13T04:23:50Z by zooko

Okay, that meta-issue about disappearing exceptions is now #1021.

comment:6 Changed at 2011-12-20T18:47:26Z by zooko

  • Cc warner added
  • Resolution set to cannot reproduce
  • Status changed from new to closed

I wonder if we should just close this ticket as "irreproducible and possibly fixed by some other change in the interim". This may have been related to a bug that we had a long time ago in the combination of Tahoe-LAFS+Twisted which caused very large strings to be produced during exceptions (#379). This issue may have been caused by that, and there may be nothing more we need to do about it.

I'm going to add Cc: Brian as I close this so he has once last chance to look at it and think if we should do anything else about it. :-)

comment:7 Changed at 2011-12-21T20:31:49Z by warner

No idea. Smells like an out-of-memory problem, or memory corruption, or something. It's unlikely that this server tried to buffer 2GiB of output data to a client, unless maybe someone had a multi-gigabyte file with k=1 and a client which overparallelized segment downloads (which I suppose we did have briefly, but which is fixed now).

So yeah, CANNOTREPRODUCE sounds fine.

Note: See TracTickets for help on using tickets.