#889 closed defect (fixed)

'tahoe deep-check --repair --add-lease' triggers an exception in the client

Reported by: francois Owned by: francois
Priority: critical Milestone: 1.6.0
Component: code Version: 1.5.0
Keywords: logging repair Cc:
Launchpad Bug:

Description

After an upgrade to the latest trunk and a new backup using immutable dirnodes, the daily cronjob which verifies that my data is safely stored logged the following error in twistd.log.

2010-01-10 00:08:27+0100 [-] Unhandled error in Deferred:
2010-01-10 00:08:27+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn
            cb(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail
            self.deferred.errback(why)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed
            level=log.WEIRD, umid="hEGuQg")
          File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 24, in err
            tw_log.err(*args, **kwargs)
          File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 153, in err
            _stuff = failure.Failure()
          File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__
            raise NoCurrentExceptionError()
        twisted.python.failure.NoCurrentExceptionError:

checker.py near line 13 looks like that:

        log.err(format="local error in add_lease to [%(peerid)s]: %(f_value)s",
                peerid=idlib.shortnodeid_b2a(peerid),
                f_value=str(f.value),
                failure=f,
                level=log.WEIRD, umid="hEGuQg")

And no accident file was created.

Attachments (1)

debug-log.txt (19.2 KB) - added by francois at 2010-01-10T22:34:05Z.
Debug log coming from 'flogtool tail' running during the error.

Download all attachments as: .zip

Change History (18)

comment:1 Changed at 2010-01-10T02:09:01Z by francois

  • Owner changed from somebody to francois
  • Status changed from new to assigned
(01:06:02 AM) zooko_afk: francois: i'm looking at util/log.py
(01:06:20 AM) zooko_afk: it calls twisted's log.err just to force any unit test that might be running to fail
(01:06:37 AM) zooko_afk: but twisted's log.err expects there to be an exception in existence
(01:06:43 AM) zooko_afk: which in your case there isn't
(01:06:55 AM) zooko_afk: I suggestr that...  we either remove that 
(01:07:13 AM) zooko_afk: or catch this error frpom twoisted and ignore it (ugh)
(01:07:38 AM) zooko_afk: I suggest the former, as a patch for tahoe-lafs v1.6, to be reviewed by brian.
(01:07:41 AM) zooko_afk: want to test it?
(01:07:54 AM) zooko_afk: thos will also reveal the original exception that you found.
(01:08:09 AM) zooko_afk: gotta go do something w irby for a while...
(01:08:39 AM) francois: it won't breaks things if I completely remove it?
(01:08:46 AM) francois: I'll give it a try
(01:44:54 AM) zooko_afk: It looks to me like the tw_log.err() line is unnecessary and is the cause of the problem where the original error is not reported.

comment:2 Changed at 2010-01-10T02:26:59Z by francois

  • Keywords review-needed added
  • Owner changed from francois to warner
  • Status changed from assigned to new

The incriminated lines comes from this patch.

Sat Sep 20 19:34:27 CEST 2008  warner@lothar.com          
  * util.log: send log.err to Twisted too, so that Trial tests are flunked
diff -rN old-tahoe-upstream/src/allmydata/util/log.py new-tahoe-upstream/src/allmydata/util/log.py
2a3
> from twisted.python import log as tw_log
15a17,21
> # If log.err() happens during a unit test, the unit test should fail. We
> # accomplish this by sending it to twisted.log too. When a WEIRD/SCARY/BAD
> # thing happens that is nevertheless handled, use log.msg(failure=f,
> # level=WEIRD) instead.
> 
17c23
<     # this should probably be in foolscap
---
>     tw_log.err(*args, **kwargs)

Brian, what do you think? I'm not sure whether this patch should be rollback.

comment:3 Changed at 2010-01-10T05:58:16Z by zooko

  • Priority changed from major to critical

Looks critical to me, because it looks like we can't add leases to immutable directories (guessing that this is the underlying WEIRD that didn't get logged). Hopefully François will report more after removing the tw_log.err() feature.

comment:4 Changed at 2010-01-10T05:59:13Z by zooko

  • Owner changed from warner to francois

François: don't wait for Brian to look at the tw_log.err() thing -- investigating the underlying error is more urgent. Also I hereby take responsibility if Brian later says that removing that line was totally wrong. :-)

comment:5 Changed at 2010-01-10T08:03:22Z by warner

Huh, I didn't know that twisted.python.log was so picky about tw_log.err being called outside of an exception context. Looks like this has been around since twisted-2.4.0, though.

So, we want the tw_log.err to be there, but we also want to make sure that an exception in it won't cause the subsequent foolscap log.err to get bypassed (and worse, any other code following the call to allmydata.util.log.err will also be bypassed, although usually log.err lives at the end of the callback chain so there is rarely much code sitting after it). log.msg and log.err must never raise exceptions.

It's probably still a good idea to move this functionality into Foolscap. But for now, I'd do two things:

  • have util.log.err catch-and-ignore any exceptions raised by tw_log.err, or maybe just this NoCurrentExceptionError
  • have util.log.err pull any failure= argument out of its kwargs and pass it as the first positional argument to tw_log.err . The root problem here is that the _add_lease_failed function is an errback, so the relevant exception has already been captured and bundled up into a Failure instance, but tw_log.err isn't being told about it properly. Foolscap's log.err looks for a failure in the failure= argument, but tw_log.err wants it to be the first positional argument. If it doesn't find one there, then it tries to create a new Failure, which explodes when you're in an errback and not in a synchronous exception handler.

I agree with Zooko that commenting out the tw_log.err and reproducing the problem is a good direction to go. I'd like to know what the local error was.

comment:6 Changed at 2010-01-10T14:51:37Z by francois

Ok, after commenting tw_log.err(), the same exception occurs, but in foolscap this time. I'll try to add some code to dump more informations but it's gonna be slow because it takes about 30 minutes before the problem occurs.

2010-01-10 14:38:24+0100 [-] Unhandled error in Deferred:
2010-01-10 14:38:24+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn
            cb(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail
            self.deferred.errback(why)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed
            level=log.WEIRD, umid="hEGuQg")
          File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 27, in err
            return log.err(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/logging/log.py", line 218, in err
            _stuff = failure.Failure()
          File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__
            raise NoCurrentExceptionError()
        twisted.python.failure.NoCurrentExceptionError:

comment:7 Changed at 2010-01-10T20:58:30Z by francois

Here is the arguments that log.err() received.

2010-01-10 20:08:21+0100 [-] {'level': 30, 'format': 'local error in add_lease to [%(peerid)s]: %(f_value)s', 'failure': <twisted.python.failure.Failure <class 'foolscap.ipb.DeadReferenceError'>>, 'f_value': 'Connection was lost', 'peerid': 'wpxpieqx', 'umid': 'hEGuQg'}

comment:8 Changed at 2010-01-10T22:27:57Z by zooko

class 'foolscap.ipb.DeadReferenceError' suggests that the connection to a storage server turned out to be broken during the add lease process, and there is some part of the add-lease code which doesn't have a .addErrback() which knows how to deal with the connection failing.

Maybe try the fix to logging that Brian suggested and see if the correctly-logged diagnostics still suggest that this is the problem?

Changed at 2010-01-10T22:34:05Z by francois

Debug log coming from 'flogtool tail' running during the error.

comment:9 Changed at 2010-01-10T22:37:04Z by francois

The incriminated peer 'wpxpieqx' is running the following Tahoe version and is still accepting new shares, ie the disk is not full.

allmydata-tahoe: 1.5.0-r4054, foolscap: 0.4.2, pycryptopp: 0.5.16-r672, zfec: 1.4.6-r333, Twisted: 8.1.0, Nevow: 0.9.31, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-debian_5.0.3-i686-32bit, sqlite: 3.5.9, simplejson: 1.9.2, argparse: 1.0.1, pyOpenSSL: 0.7, pyutil: 1.4.1, zbase32: 1.1.1, setuptools: 0.6c8, pysqlite: 2.3.2

comment:10 Changed at 2010-01-10T22:54:41Z by francois

This is what gets logged in the storage node itself.

2010-01-10 20:08:23+0100 [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: [b5le] replaced by a new connection

I also saw occurrences of this exception from long ago when the Tahoe client was running 1.5.0, so it isn't likely to be a new bug introduced in 1.6.0.

comment:11 Changed at 2010-01-11T01:27:59Z by zooko

  • Summary changed from 'tahoe deep-check --repair --add-lease' triggers a logging error to 'tahoe deep-check --repair --add-lease' triggers an exception in the client

comment:12 Changed at 2010-01-11T04:21:00Z by zooko

  • Keywords review-needed removed

comment:13 Changed at 2010-01-11T04:21:24Z by zooko

  • Keywords logging added; log removed

comment:14 Changed at 2010-01-11T19:13:48Z by warner

hrm, DeadReferenceError is supposed to be a remote failure, not a local one. It should have followed the other error path (the one which says "error in add_lease from ..." instead of "local error in add_lease to..").

That said, yeah, even the remote-failure branch isn't looking for DeadReferenceError. That should certainly be ignored.

Ok, so I think we have three issues here:

  • allmydata.util.log.err must be fixed to pass a failure into tw_log.err properly
  • decide if DeadReferenceError is supposed to pass f.check(RemoteException), fix it to do so
  • fix _add_lease_failed to ignore DeadReferenceError, in both immutable and mutable cases

comment:15 Changed at 2010-01-12T00:15:35Z by warner

I've opened (and closed as WONTFIX) Foolscap#145 for the second issue: I decided that DeadReferenceError should remain separate. And therefore I just pushed bacb6fe5aad1f416 to cover the last issue: ignore DeadReferenceError in add_lease calls.

So the remaining issue is to get util.log.err fixed.

comment:16 follow-up: Changed at 2010-01-12T01:45:13Z by warner

d888bf33772327f2 includes the other fix, to call log.err properly (either with a Failure as the first positional argument, or in an exception-handler context).

I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

comment:17 in reply to: ↑ 16 Changed at 2010-01-12T13:37:43Z by francois

  • Resolution set to fixed
  • Status changed from new to closed

Replying to warner:

I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

Yes, the original problem (exception trigged on the client) is fixed now, thanks for that. I still see foolscap.ipb.DeadReferenceError on the storage nodes a few times a day, does that qualify as frequently? Any hint on how to debug those foolscap disconnections would be very welcome.

Anyway, it's probably time move the disconnection problem to another ticket, see #896.

Note: See TracTickets for help on using tickets.