[tahoe-dev] [tahoe-lafs] #653: introducer client: connection count is wrong, !VersionedRemoteReference needs EQ

Wed Sep 2 15:00:54 PDT 2009

#653: introducer client: connection count is wrong, !VersionedRemoteReference
needs EQ
--------------------------+-------------------------------------------------
 Reporter:  warner        |           Owner:  warner  
     Type:  defect        |          Status:  assigned
 Priority:  major         |       Milestone:  1.6.0   
Component:  code-network  |         Version:  1.3.0   
 Keywords:                |   Launchpad_bug:          
--------------------------+-------------------------------------------------

Comment(by warner):

 Zooko and I talked and did some more analysis. Based on that, we think
 there's a high probability of a foolscap bug (still present in the latest
 0.4.2) that causes notifyOnDisconnect to sometimes not get called,
 probably triggered by "replacement connections" (i.e. where NAT table
 expiries or something cause an asymmetric close, one side reconnects, and
 the other side must displace an existing-but-really-dead connection with
 the new inbound one).

 The tahoe code was rewritten to reduce the damage caused by this sort of
 thing. We could change it further, to remove the use of notifyOnDisconnect
 altogether, with two negative consequences:

  * the welcome-page status display would be unable to show "Connected /
 Not Connected" status for each known server. Instead, it could say "Last
 Connection Established At / Not Connected". Basically we'd know when the
 connection was established, and (with extra code) we could know when we
 last successfully used the connection. And when we tried to use the
 connection and found it down, we could mark the connection as down until
 we'd restablished it. But we wouldn't notice the actual event of
 connection loss (or the resulting period of not-being-connected) until we
 actually tried to use it. So we couldn't claim to be "connected", we could
 merely claim that we *had* connected at some point, and that we haven't
 noticed becoming disconnected yet (but aren't trying very hard to notice).
  * the share-allocation algorithm wouldn't learn about disconnected
 servers until it tried to send a message to them (this would fail quickly,
 but still not synchronously), but allocates share numbers ahead of time
 for each batch of requests. This could wind up with shares placed
 0,1,3,4,2 instead of 0,1,2,3,4

 The first problem would be annoying, so I think we're going to leave tahoe
 alone for now. I'll add a note to the foolscap docs to warn users about
 the notifyOnDisconnect bug, and encourage people to not rely upon it in
 replacement-connection -likely environments.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/653#comment:18>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid