#1824 new defect

Tahoe process gone wild

Reported by: kpreid Owned by: daira
Priority: critical Milestone: soon
Component: code Version: 1.9.2
Keywords: hang repair memory Cc:
Launchpad Bug:

Description

I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU.

It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server.

Further details will be added later; I have no time to hunt for logs now.

Randomised start: Sun Oct 14 18:42:35 PDT 2012
*** calendar-publish-dir
	done: 3 objects checked
	 pre-repair: 3 healthy, 0 unhealthy
	 0 repairs attempted, 0 successful, 0 failed
	 post-repair: 3 healthy, 0 unhealthy

*** family
	 repair successful
	done: 2 objects checked
	 pre-repair: 1 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 1 healthy, 1 unhealthy

*** publish
Traceback (most recent call last):
 File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module>
   load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')()
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run
   rc = runner(sys.argv[1:], install_node_control=install_node_control)
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner
   rc = cli.dispatch[command](so)
 File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck
   rc = tahoe_check.deepcheck(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck
   return DeepCheckStreamer().run(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run
   resp = do_http("POST", url)
 File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 68, in do_http
   return c.getresponse()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1013, in getresponse
   response.begin()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 402, in begin
   version, status, reason = self._read_status()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 366, in _read_status
   raise BadStatusLine(line)
httplib.BadStatusLine: ''

*** tahoe-illustration
Traceback (most recent call last):
 File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module>
   load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')()
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run
   rc = runner(sys.argv[1:], install_node_control=install_node_control)
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner
   rc = cli.dispatch[command](so)
 File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck
   rc = tahoe_check.deepcheck(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck
   return DeepCheckStreamer().run(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run
   resp = do_http("POST", url)
 File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 60, in do_http
   c.endheaders()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders
   self._send_output(message_body)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output
   self.send(msg)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send
   self.connect()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 740, in connect
   self.timeout, self.source_address)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection
   raise err
socket.error: [Errno 61] Connection refused

Change History (24)

comment:1 Changed at 2012-10-15T15:17:49Z by zooko

When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of tahoe --version. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?

comment:2 in reply to: ↑ description Changed at 2012-10-15T15:23:40Z by davidsarah

  • Component changed from unknown to code
  • Keywords hang repair added
  • Owner changed from davidsarah to kpreid
  • Version changed from unknown to 1.9.2

Replying to kpreid:

*** family
	 repair successful
	done: 2 objects checked
	 pre-repair: 1 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 1 healthy, 1 unhealthy

This confusing report is probably due to #766, and probably unrelated to the main issue in this ticket.

comment:3 Changed at 2012-10-15T15:24:40Z by davidsarah

  • Keywords memory added

comment:4 Changed at 2012-10-16T03:14:16Z by kpreid

There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it.

2012-10-15 14:05:34.469Z [-] Unhandled error in Deferred:
2012-10-15 14:05:38.016Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it:

2012-10-15 01:43:37.119Z [-] Unhandled error in Deferred:
2012-10-15 01:43:37.136Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=gtnn) (during method=RIStorageServer.tahoe.allmydata.com:slot_readv)

comment:5 follow-ups: Changed at 2012-10-16T15:07:55Z by davidsarah

OK, so:

  • we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
  • the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.

comment:6 Changed at 2012-10-17T02:08:45Z by davidsarah

  • Milestone changed from undecided to 1.11.0
  • Priority changed from normal to major

comment:7 Changed at 2012-10-17T02:09:30Z by davidsarah

  • Owner kpreid deleted

comment:8 Changed at 2012-11-12T23:24:46Z by davidsarah

This tahoe-dev thread might be the same problem: https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html

comment:9 Changed at 2012-11-13T23:25:08Z by zooko

  • Priority changed from major to critical

Marking this as Priority: Critical, because it seems more important to me than most of the tickets marked Priority: Major.

comment:10 Changed at 2012-11-15T11:43:09Z by zooko

Iantcho posted an incident report (trigger: he pushed the button to generate an incident report):

https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html

comment:11 Changed at 2012-12-25T13:22:23Z by kpreid

Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors:

2012-12-25 08:04:08.619Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: allmydata.interfaces.DownloadStopped: our Consumer called stopProducing()
	
2012-12-25 08:04:08.677Z [-] Unhandled error in Deferred:
2012-12-25 08:04:08.677Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 163, in gotPageContext
	    pageContext.tag.renderHTTP, pageContext
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 133, in maybeDeferred
	    result = f(*args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 490, in renderHTTP
	    d.addCallbacks(_finished, _error)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 489, in _error
	    eh.renderHTTP_exception(ctx, f)
	  File "/External/Projects/tahoe/src/allmydata/web/common.py", line 326, in renderHTTP_exception
	    return super.renderHTTP_exception(self, ctx, f)
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 61, in renderHTTP_exception
	    request.finishRequest( False )
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 176, in finishRequest
	    server.Request.finish(self)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/web/http.py", line 866, in finish
	    "Request.finish called on a request after its connection was lost; "
	exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
	
2012-12-25 09:16:06.623Z [-] Unhandled error in Deferred:
2012-12-25 09:16:06.909Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

comment:12 Changed at 2013-01-02T07:03:54Z by zooko

Thank you, kpreid! That last exception showing that rref was None is a smoking gun that may be useful!

comment:13 in reply to: ↑ 5 Changed at 2013-01-03T04:54:02Z by zooko

Okay, I just (re-?) noticed that the rref is None was reported by kpreid earlier (comment:4) and that then davidsarah wrote:

Replying to davidsarah:

OK, so:

  • we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
  • the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.

Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!

comment:14 Changed at 2013-04-03T01:48:55Z by daira

#1939 is a possible duplicate.

comment:15 Changed at 2013-04-03T01:52:08Z by daira

  • Owner set to kpreid

kpreid, can you check whether the incident logs on #1939 are consistent with this ticket?

comment:16 follow-up: Changed at 2013-04-03T02:22:20Z by kpreid

  • Owner changed from kpreid to daira

They aren't, I think. There's no major spam and the errors don't look similar.

comment:17 in reply to: ↑ 16 Changed at 2013-04-03T04:07:51Z by daira

Replying to kpreid:

They aren't, I think. There's no major spam and the errors don't look similar.

Okay, thanks for looking at that.

comment:18 in reply to: ↑ 5 Changed at 2013-04-23T21:09:30Z by zooko

Replying to davidsarah:

OK, so:

  • we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?

#1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?

comment:19 follow-up: Changed at 2013-04-24T01:10:00Z by kpreid

kpreid: what version of Tahoe-LAFS generated the output you show here?

The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is dd2e7a1127353a60 plus the patch at https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:7 .

(Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)

comment:20 in reply to: ↑ 19 Changed at 2013-04-24T21:00:06Z by zooko

Replying to kpreid:

kpreid: what version of Tahoe-LAFS generated the output you show here?

The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is dd2e7a1127353a60 plus the patch at https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:7 .

Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in comment:5 by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now?

kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!

comment:21 follow-up: Changed at 2013-04-25T01:35:11Z by daira

The tracebacks in comment:4 and comment:11 do look the same as in #1636.

What I meant was: why was the traceback repeating every 0.11 seconds in comment:4? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in ticket:1636#comment:9 would have incidentally fixed it.

comment:22 in reply to: ↑ 21 Changed at 2013-04-25T15:22:48Z by zooko

Replying to daira:

The tracebacks in comment:4 and comment:11 do look the same as in #1636.

What I meant was: why was the traceback repeating every 0.11 seconds in comment:4? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in ticket:1636#comment:9 would have incidentally fixed it.

Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

comment:23 Changed at 2013-04-25T15:26:08Z by kpreid

Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

That seems unlikely: I have a total of 15 files.

comment:24 Changed at 2013-04-25T19:53:13Z by zooko

I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line:

	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed

But in the 1.9.2 version of retrieve.py and in the current version of retrieve.py there isn't the same code.

Hm... let's see. There is no way that I can see that _validation_or_decoding_failed could lead to a loop, in 1.9.0. Is there any way that there could be a loop that calls out to _validation_or_decoding_failed over and over? Well, this stuff here is complicated. Could it have a loop? No, I don't see it. Any errback from the call to _validation_or_decoding_failed should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes named dl. Anyway, I think the intent is that the errback goes here, where it terminates and doesn't loop: loop().

The modern code that replaced this doesn't have the two variables named dl. It still looks like you'll get an exception from the errback (named _handle_bad_shares in this version). Hm, in the current version that exception will then get ignored by line 642 dl = deferredutil.gatherResults(ds) since deferredutil.gatherResults() always sets consumeErrors=True. Could that lead to a loop?

Daira: help!

Note: See TracTickets for help on using tickets.