Opened at 2007-05-02T21:46:39Z
Closed at 2012-06-12T22:56:04Z
#26 closed defect (fixed)
introducer doesn't seem to forget about old peers, or peers don't forget about old peers
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-network | Version: | 0.6.0 |
Keywords: | availability introducer scalability performance | Cc: | |
Launchpad Bug: |
Description
if you restart a node, the new instance retains the same TubID as the old one (since we stash the SSL certificate in the 'client.pem' file), but it gets a new so-called "Swiss Number". The introducer node is then informed of both, and this gets announced to everybody else.
The problem is that it looks like the rest of the world doesn't forget about the old swissnumber when the first instance of the node is shut down. They keep trying to reconnect to both the old one and the new one. The connection attempts to the old one *are* able to attach to the Tub (since the certificate is still the same), but of course the old swissnumber is gone, and thus you get benign but annoying error messages in everybody's log files as the getReferenceByName call fails.
I think that our introducer scheme is the absolute simplest thing that could possibly work, and as such it doesn't every send out negative announcements. Without this, the clients will not know they should turn off their Reconnectors for the missing peers, and they'll keep trying to hit it over and over again.
We should fix this.
Change History (14)
comment:1 Changed at 2007-05-02T23:25:00Z by warner
comment:2 Changed at 2007-05-04T05:17:50Z by warner
- Owner changed from somebody to warner
- Priority changed from minor to major
- Status changed from new to assigned
I'm working on a patch for this. It will change the RIIntroducerClient protocol, though, by adding a lost_peers message.
comment:3 Changed at 2007-05-08T02:18:41Z by warner
- Resolution set to fixed
- Status changed from assigned to closed
I've checked in that patch, so this issue is closed. I'll probably wait until the 0.2.1 release before upgrading testnet, though.
comment:4 Changed at 2007-05-08T15:27:29Z by zooko
I'm not sure what the problem is exactly:
- annoying messages in logs
- wasted effort trying to reconnect to a node that never comes back
- wasted effort trying to reconnect to a node that you already have a new, better connection to
?
comment:5 Changed at 2007-05-08T15:28:39Z by zooko
- Resolution fixed deleted
- Status changed from closed to reopened
Oh I didn't realize you'd already implemented the lost_peers message. So far as I understand the issue, that doesn't seem like a good idea. Let's re-open this ticket and talk about it.
comment:6 Changed at 2007-05-09T07:42:24Z by warner
The log messages are annoying, but that's just a symptom of the real problem. #3 is the real issue, although it would probably be more accurate to describe the new connection as "better" because the old one is completely useless.
The way we create the IntroducerClient? FURLs causes us to pick a new one each time the node boots, so once a node has shut down, the FURL that was used for that incarnation will never be used again, so it's inappropriate for anyone else to remember it. The log messages show up because the new incarnation of the node *does* use the same SSL certificate, and therefore gets the same TubID, and probably has the same IP address, so every other node in the system is trying to talk to the ghosts of our previous lives, and we log a getReferenceByName failure for each attempt.
Our current peer-connection/mesh-maintenance algorithm (v1 = fully-connected mesh) obligates us to treat nodes which are connected to the Introducer as active, and nodes which are not as inactive. So for both these reasons, once the node goes away, we need to stop trying to talk to it.
Part of this problem could be addressed by persisting the randomly-generated FURL so that each node's IntroducerClient? would have the same name each time. (I'm not convinced this is the right approach for this issue; independent of that I think this sort of persistence is an important pattern that I want Foolscap to provide convenient access to). If we had that, then the client would get the same FURL each time, and the other nodes could only keep trying to connect to a single object per Tub instead of lots of them. This would still treat nodes that are no longer in contact with the Introducer as being active, however, which doesn't match what our v1 specs call for, and will cause 1 connection attempt per hour per active peer for each of the IP addresses that have been used by nodes in the past.
comment:7 Changed at 2007-05-09T18:39:47Z by warner
I've rolled back the lost_peers change pending further discussion.
comment:8 Changed at 2007-05-09T18:50:00Z by zooko
Chatting with Brian on the phone.
A better way to describe this problem is "#4: wasted effort (on both initiator and recipient) trying to use a connection which you know can never succeed".
The annoying logging is also very important -- we want to improve logging. We want to add ways to control what gets logged and the way it gets logged, and it also important to reduce false-positives in the log so as to facilitate effective use of the logs.
The centralized introducer is an implementation expedience -- the long term goal is to reduce and eliminate such points of centralization.
Restating my original list of possible problems that we might want to solve:
1. False alarms in log. 2. Wasted effort trying to connect and failing. 3. Wasted effort trying to connect on a channel which has been superceded by a "better" channel. 4. Wasted effort trying to connect on a channel which you know can never succeed.
We'll get back to this when Brian has more design bandwidth. For the moment we're leaving ugly log messages and wasted effort in place.
comment:9 Changed at 2007-06-07T19:02:57Z by warner
Zooko and I chatted a bit, and we've settled on a short-term (0.2.1) fix: just persist the IntroducerClient?'s swissnumbers. That will get rid of the noise from clients who restart (and would thus change their furls). It will leave the noise from clients who shut down and never restart, but I think that's less of an issue.
One datapoint:
- the testnet machines are emitting 300kB of logs per day, with no user activity. This is entirely a result of these wasted connections
Zooko is concerned about inadvertently adding (one might say "introducing") centralization to our architecture. Giving the Introducer the ability to forcibly disconnect peers turns it from an Introducer into a Dictator-Of-Who-Gets-To-Be-In-The-Mesh, and if we actually want that, it should be explicit and distinct. There are several reasons for wanting to create "private" meshes, but these will be implemented with membership credentials rather than by trying to control introduction.
The next connection-management milestone will be updated to specify that the Introducer exists to improve peer-discovery, but the mesh should not be dependent upon it. In particular, if the Introducer goes away, all existing connections should continue to work, and nodes should maintain their own independent decisions about which peers are useful to connect to and which are not. Eventually we want the mesh to grow via peer-learned gossip and reduce our dependence upon the Introducer until it goes away completely or only exists to bootstrap the network.
comment:10 Changed at 2007-08-14T18:53:15Z by warner
- Component changed from code to code-network
comment:11 Changed at 2007-09-25T04:18:16Z by zooko
- Milestone set to undecided
- Version set to 0.6.0
comment:12 Changed at 2009-12-13T04:03:44Z by davidsarah
- Keywords availability introducer scalability added
comment:13 Changed at 2010-12-16T00:57:28Z by davidsarah
- Keywords performance added
comment:14 Changed at 2012-06-12T22:56:04Z by warner
- Resolution set to fixed
- Status changed from reopened to closed
The "short term fix" has been in place successfully for 5 years now, and is working fine. Nodes use stable FURLs, so the only deadwood/noise is from nodes that have left for good. Until clients remember servers independently of the Introducer, occasional reboots will clear out dead announcements (specifically: after the node leaves, after the next Introducer reboot, then after the following client reboot, the client will no longer attempt to contact the missing node).
When clients are modified to remember servers on their own (as part of the #68 distributed-introduction work), they should include some timeouts, so nodes that haven't been heard from in any form (gossip/introduction or actual connection) can be forgotten.
I checked the code, and the introducer does indeed forget about peers that go away, but it does not tell anyone about their loss.
I think we need to add a 'lost_peers' method next to the existing 'new_peers' one. It should take a set of FURLs that are no longer in the mesh. The IntroducerClient? should shut down any reconnector they have for a FURL that appears in a 'lost_peers' message.
The fact that the introducer sends these out in a strict order means (I believe) that the new_peers/lost_peers messages for a given FURL should be strictly interleaved: no add/add/lost/lost situations. I think that makes this safe from races.