[tahoe-dev] [tahoe-lafs] #653: introducer client: connection count is wrong, !VersionedRemoteReference needs EQ

Sat Jul 18 16:57:10 PDT 2009

#653: introducer client: connection count is wrong, !VersionedRemoteReference
needs EQ
--------------------------+-------------------------------------------------
 Reporter:  warner        |           Owner:  warner  
     Type:  defect        |          Status:  assigned
 Priority:  major         |       Milestone:  1.6.0   
Component:  code-network  |         Version:  1.3.0   
 Keywords:                |   Launchpad_bug:          
--------------------------+-------------------------------------------------

Comment(by zooko):

 The only part that I'm not sure about is the way that multiple
 {{{RemoteReferences}}} piled up.  I.e., I'm confident that there is not a
 bug in TahoeLAFS with regard to this issue, but I'm not confident that
 there isn't a bug in foolscap about this issue.  Note that we didn't
 change the version of foolscap on the testgrid webapi (foolscap 0.4.1).

 Maybe if I write down what I know about it in this ticket that will help
 me understand it better.

 Could you summarize what was changed which made me think (in comment:4)
 that foolscap v0.4.0 fixed the ''original'' issue of having multiple
 {{{RemoteReferences}}} over the same TCP connection to the same peer?

 Whatever it was, it didn't fix the problem that was demonstrated using
 TahoeLAFS [3897] and foolscap 0.4.1 in which there were many more tuples
 of {{{(serverid, servicetype, rref)}}} than there were servers.  This
 suggests that foolscap 0.4.1 must have been creating new
 {{{RemoteReferences}}} and not calling back the {{{notifyOnDisconnect()}}}
 method to clean up old {{{RemoteReferences}}} to the same peers.  Now,
 TahoeLAFS in version [3997] no longer keeps tuples of {{{(peerid,
 servicetype, rref)}}}, instead it keeps a dict mapping from {{{peerid}}}
 to {{{NativeStorageClientDescriptor}}}, where a
 {{{NativeStorageClientDescriptor}}} has at most one {{{RemoteReference}}}.

 When foolscap calls {{{NativeStorageClientDescriptor._got_connection()}}},
 then the {{{NativeStorageClientDescriptor}}} eventually sets
 {{{self.rref}}} to reference the new {{{RemoteReference}}} (see
 [source:src/allmydata/storage_client.py at 20090717050709-66853-80eda37caf0df376be79f45cbed728999b68a843#L218
 storage_client.py line 218].)  If its {{{self.rref}}} already had a
 reference to a {{{RemoteReference}}} then it would be overwritten by the
 new one, so the problem of lots of redundant {{{RemoteReference}}}s piling
 up certainly can't happen in TahoeLAFS anymore.

 But, if foolscap is generating redundant {{{RemoteReference}}}s passing
 them to TahoeLAFS, could this cause other problems?  For example, I see
 that {{{NativeStorageClientDescriptor}}} calls
 {{{rref.notifyOnDisconnect(self._lost)}}}.  If foolscap is generating
 redundant {{{rref}}}s, then maybe the {{{self._lost}}} will eventually be
 called by one of the old ones that is no longer referenced by the
 {{{NativeStorageClientDescriptor}}}, and if so then the current {{{rref}}}
 will get spuriously removed?

 Maybe we should add some sanity checking to
 {{{NativeStorageClientDescriptor}}} such as asserting that when its
 {{{_got_connection()}}} is called that its {{{self.rref}}} is required to
 be currently {{{None}}}, or asserting that when its {{{self._lost()}}}
 method is called that the current value of {{{self.rref}}} is the same as
 it was when the {{{notifyOnDisconnect()}}} was called?  In the interests
 of a stable v1.5, maybe we could make it so that if these checks fail it
 does something "safe", such as if {{{self._lost()}}} is called when
 {{{self.rref}}} is not equal to the same thing that it was when the
 {{{notifyOnDisconnect()}}} was called, that it leaves the {{{self.rref}}}
 alone (as well as logging a high-scariness incident.

 What do you think?

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/653#comment:16>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid