Opened at 2012-10-09T19:54:32Z
Last modified at 2015-08-25T17:39:49Z
#1823 new defect
is this server telling me over foolscap that its foolscap connection to me just broke?
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | code-network | Version: | 1.9.2 |
Keywords: | foolscap | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
My local wifi is flapping up and down as I try to fix it, and I just got this failure from my local Tahoe-LAFS gateway, which is connected to exactly one storage server (a LeastAuthority.com server).
yu3a is my gateway, 66ra is the storage server.
06:20:34.655 [1422]: UNUSUAL an outbound callRemote (that we [yu3a] sent to someone else [66ra]) failed on the far end 06:20:34.655 [1423]: reqID=1899, rref=<RemoteReference at 0x33cf3d0 [pb://66rayrcb444dtmcbkuu4typhxcijdg4l@75.101.197.126:12346,10.120.62.180:12346/mbgetcemfcxlejev25bwfyhzrbdiowiu]>, methname=RIStorageServer.tahoe.allmydata.com.get_buckets 06:20:34.655 [1424]: the REMOTE failure was: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ]'> ] 06:20:34.655 [1425]: WEIRD <Checker #947>(67dnhwu46gw4): failure from server on 'get_buckets' the REMOTE failure was: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ]'> ] [INCIDENT-TRIGGER]
I get confused about how many levels of "wrapping of remote thing" I'm seeing but since there are only two nodes involved, or three if you count the introducer, then I guess this means that the storage server told my local gateway that it had lost a foolscap connection. Does that make sense? Pretty confusing.
The server was marked as not-connected on the welcome page after this, but Foolscap did eventually reconnect.
Is this a bug, or should we just close this ticket as not-a-bug?
Change History (2)
comment:1 Changed at 2012-10-09T20:49:07Z by warner
comment:2 Changed at 2015-08-25T17:39:49Z by zooko
- Description modified (diff)
Would f.check(RemoteException) and f.failure.check(DeadReferenceError) evaluate to true when the Foolscap reference to the peer broke, but as long as the Foolscap reference didn't break, then it would be guaranteed never to evaluate to true, even if the remote peer was sending us messages maliciously crafted to try to make it to evaluate to true? That's what I want.
I think is this really the near end telling you that the connection was lost. It might be nice to clean up the error reporting (maybe suppressing it entirely).
I believe the root cause was just a TCP connection that dropped while a storage-server message (get_buckets) was outstanding. The first log message (UNUSUAL) was recorded by Foolscap itself, in foolscap/call.py around line 79, because tahoe turned on logRemoteFailures. It's not being entirely accurate when it says "failed on the far end": what it really means is that the failure occurred after the serialized arguments were sent (so it's not a failure in serialization). Network partitions are, of course, indistinguishable from the remote end suddenly terminating, so they're reported in the same way as real remote-side errors (such as "unknown swissnum" or a schema violation that made it through to the far side).
The second log message (WEIRD) was recorded by Tahoe, in src/allmydata/immutable/checker.py around line 515. It looks like the preceding line was meant to catch it (if f.check(DeadReferenceError)), but the RemoteException?-wrapping prevented the check from working. I'm guessing this code was written before you convinced me to make Foolscap report errors more uniformly, and it didn't get updated.
I'm not sure how exactly it should get updated, though. We probably settled on a scheme.. maybe if f.check(RemoteException) and f.failure.check(DeadReferenceError)?