Opened at 2012-10-15T15:04:12Z
Last modified at 2013-04-25T19:53:13Z
#1824 new defect
Tahoe process gone wild
Reported by: | kpreid | Owned by: | daira |
---|---|---|---|
Priority: | critical | Milestone: | soon |
Component: | code | Version: | 1.9.2 |
Keywords: | hang repair memory | Cc: | |
Launchpad Bug: |
Description
I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU.
It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server.
Further details will be added later; I have no time to hunt for logs now.
Randomised start: Sun Oct 14 18:42:35 PDT 2012 *** calendar-publish-dir done: 3 objects checked pre-repair: 3 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 3 healthy, 0 unhealthy *** family repair successful done: 2 objects checked pre-repair: 1 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 1 healthy, 1 unhealthy *** publish Traceback (most recent call last): File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')() File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck rc = tahoe_check.deepcheck(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck return DeepCheckStreamer().run(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run resp = do_http("POST", url) File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 68, in do_http return c.getresponse() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1013, in getresponse response.begin() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 402, in begin version, status, reason = self._read_status() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 366, in _read_status raise BadStatusLine(line) httplib.BadStatusLine: '' *** tahoe-illustration Traceback (most recent call last): File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')() File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck rc = tahoe_check.deepcheck(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck return DeepCheckStreamer().run(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run resp = do_http("POST", url) File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 60, in do_http c.endheaders() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders self._send_output(message_body) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output self.send(msg) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send self.connect() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 740, in connect self.timeout, self.source_address) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection raise err socket.error: [Errno 61] Connection refused
Change History (24)
comment:1 Changed at 2012-10-15T15:17:49Z by zooko
comment:2 in reply to: ↑ description Changed at 2012-10-15T15:23:40Z by davidsarah
- Component changed from unknown to code
- Keywords hang repair added
- Owner changed from davidsarah to kpreid
- Version changed from unknown to 1.9.2
comment:3 Changed at 2012-10-15T15:24:40Z by davidsarah
- Keywords memory added
comment:4 Changed at 2012-10-16T03:14:16Z by kpreid
There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it.
2012-10-15 14:05:34.469Z [-] Unhandled error in Deferred: 2012-10-15 14:05:38.016Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'
Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it:
2012-10-15 01:43:37.119Z [-] Unhandled error in Deferred: 2012-10-15 01:43:37.136Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=gtnn) (during method=RIStorageServer.tahoe.allmydata.com:slot_readv)
comment:5 follow-ups: ↓ 13 ↓ 18 Changed at 2012-10-16T15:07:55Z by davidsarah
OK, so:
- we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
- the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.
comment:6 Changed at 2012-10-17T02:08:45Z by davidsarah
- Milestone changed from undecided to 1.11.0
- Priority changed from normal to major
comment:7 Changed at 2012-10-17T02:09:30Z by davidsarah
- Owner kpreid deleted
comment:8 Changed at 2012-11-12T23:24:46Z by davidsarah
This tahoe-dev thread might be the same problem: https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html
comment:9 Changed at 2012-11-13T23:25:08Z by zooko
- Priority changed from major to critical
Marking this as Priority: Critical, because it seems more important to me than most of the tickets marked Priority: Major.
comment:10 Changed at 2012-11-15T11:43:09Z by zooko
Iantcho posted an incident report (trigger: he pushed the button to generate an incident report):
https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html
comment:11 Changed at 2012-12-25T13:22:23Z by kpreid
Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors:
2012-12-25 08:04:08.619Z [-] Unhandled Error Traceback (most recent call last): Failure: allmydata.interfaces.DownloadStopped: our Consumer called stopProducing() 2012-12-25 08:04:08.677Z [-] Unhandled error in Deferred: 2012-12-25 08:04:08.677Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 163, in gotPageContext pageContext.tag.renderHTTP, pageContext File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 133, in maybeDeferred result = f(*args, **kw) File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 490, in renderHTTP d.addCallbacks(_finished, _error) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 489, in _error eh.renderHTTP_exception(ctx, f) File "/External/Projects/tahoe/src/allmydata/web/common.py", line 326, in renderHTTP_exception return super.renderHTTP_exception(self, ctx, f) File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 61, in renderHTTP_exception request.finishRequest( False ) File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/web/http.py", line 866, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this. 2012-12-25 09:16:06.623Z [-] Unhandled error in Deferred: 2012-12-25 09:16:06.909Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'
comment:12 Changed at 2013-01-02T07:03:54Z by zooko
Thank you, kpreid! That last exception showing that rref was None is a smoking gun that may be useful!
comment:13 in reply to: ↑ 5 Changed at 2013-01-03T04:54:02Z by zooko
Okay, I just (re-?) noticed that the rref is None was reported by kpreid earlier (comment:4) and that then davidsarah wrote:
Replying to davidsarah:
OK, so:
- we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
- the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.
Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!
comment:14 Changed at 2013-04-03T01:48:55Z by daira
#1939 is a possible duplicate.
comment:15 Changed at 2013-04-03T01:52:08Z by daira
- Owner set to kpreid
kpreid, can you check whether the incident logs on #1939 are consistent with this ticket?
comment:16 follow-up: ↓ 17 Changed at 2013-04-03T02:22:20Z by kpreid
- Owner changed from kpreid to daira
They aren't, I think. There's no major spam and the errors don't look similar.
comment:17 in reply to: ↑ 16 Changed at 2013-04-03T04:07:51Z by daira
Replying to kpreid:
They aren't, I think. There's no major spam and the errors don't look similar.
Okay, thanks for looking at that.
comment:18 in reply to: ↑ 5 Changed at 2013-04-23T21:09:30Z by zooko
Replying to davidsarah:
OK, so:
- we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
#1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?
comment:19 follow-up: ↓ 20 Changed at 2013-04-24T01:10:00Z by kpreid
kpreid: what version of Tahoe-LAFS generated the output you show here?
The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is dd2e7a1127353a60 plus the patch at https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:7 .
(Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)
comment:20 in reply to: ↑ 19 Changed at 2013-04-24T21:00:06Z by zooko
Replying to kpreid:
kpreid: what version of Tahoe-LAFS generated the output you show here?
The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is dd2e7a1127353a60 plus the patch at https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:7 .
Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in comment:5 by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now?
kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!
comment:21 follow-up: ↓ 22 Changed at 2013-04-25T01:35:11Z by daira
The tracebacks in comment:4 and comment:11 do look the same as in #1636.
What I meant was: why was the traceback repeating every 0.11 seconds in comment:4? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in ticket:1636#comment:9 would have incidentally fixed it.
comment:22 in reply to: ↑ 21 Changed at 2013-04-25T15:22:48Z by zooko
Replying to daira:
The tracebacks in comment:4 and comment:11 do look the same as in #1636.
What I meant was: why was the traceback repeating every 0.11 seconds in comment:4? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in ticket:1636#comment:9 would have incidentally fixed it.
Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.
comment:23 Changed at 2013-04-25T15:26:08Z by kpreid
Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.
That seems unlikely: I have a total of 15 files.
comment:24 Changed at 2013-04-25T19:53:13Z by zooko
I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line:
File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
But in the 1.9.2 version of retrieve.py and in the current version of retrieve.py there isn't the same code.
Hm... let's see. There is no way that I can see that _validation_or_decoding_failed could lead to a loop, in 1.9.0. Is there any way that there could be a loop that calls out to _validation_or_decoding_failed over and over? Well, this stuff here is complicated. Could it have a loop? No, I don't see it. Any errback from the call to _validation_or_decoding_failed should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes named dl. Anyway, I think the intent is that the errback goes here, where it terminates and doesn't loop: loop().
The modern code that replaced this doesn't have the two variables named dl. It still looks like you'll get an exception from the errback (named _handle_bad_shares in this version). Hm, in the current version that exception will then get ignored by line 642 dl = deferredutil.gatherResults(ds) since deferredutil.gatherResults() always sets consumeErrors=True. Could that lead to a loop?
Daira: help!
When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of tahoe --version. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?