[tahoe-dev] [tahoe-lafs] #653: introducer client: connection count is wrong, !VersionedRemoteReference needs EQ
tahoe-lafs
trac at allmydata.org
Sat Jul 18 16:57:10 PDT 2009
#653: introducer client: connection count is wrong, !VersionedRemoteReference
needs EQ
--------------------------+-------------------------------------------------
Reporter: warner | Owner: warner
Type: defect | Status: assigned
Priority: major | Milestone: 1.6.0
Component: code-network | Version: 1.3.0
Keywords: | Launchpad_bug:
--------------------------+-------------------------------------------------
Comment(by zooko):
The only part that I'm not sure about is the way that multiple
{{{RemoteReferences}}} piled up. I.e., I'm confident that there is not a
bug in TahoeLAFS with regard to this issue, but I'm not confident that
there isn't a bug in foolscap about this issue. Note that we didn't
change the version of foolscap on the testgrid webapi (foolscap 0.4.1).
Maybe if I write down what I know about it in this ticket that will help
me understand it better.
Could you summarize what was changed which made me think (in comment:4)
that foolscap v0.4.0 fixed the ''original'' issue of having multiple
{{{RemoteReferences}}} over the same TCP connection to the same peer?
Whatever it was, it didn't fix the problem that was demonstrated using
TahoeLAFS [3897] and foolscap 0.4.1 in which there were many more tuples
of {{{(serverid, servicetype, rref)}}} than there were servers. This
suggests that foolscap 0.4.1 must have been creating new
{{{RemoteReferences}}} and not calling back the {{{notifyOnDisconnect()}}}
method to clean up old {{{RemoteReferences}}} to the same peers. Now,
TahoeLAFS in version [3997] no longer keeps tuples of {{{(peerid,
servicetype, rref)}}}, instead it keeps a dict mapping from {{{peerid}}}
to {{{NativeStorageClientDescriptor}}}, where a
{{{NativeStorageClientDescriptor}}} has at most one {{{RemoteReference}}}.
When foolscap calls {{{NativeStorageClientDescriptor._got_connection()}}},
then the {{{NativeStorageClientDescriptor}}} eventually sets
{{{self.rref}}} to reference the new {{{RemoteReference}}} (see
[source:src/allmydata/storage_client.py at 20090717050709-66853-80eda37caf0df376be79f45cbed728999b68a843#L218
storage_client.py line 218].) If its {{{self.rref}}} already had a
reference to a {{{RemoteReference}}} then it would be overwritten by the
new one, so the problem of lots of redundant {{{RemoteReference}}}s piling
up certainly can't happen in TahoeLAFS anymore.
But, if foolscap is generating redundant {{{RemoteReference}}}s passing
them to TahoeLAFS, could this cause other problems? For example, I see
that {{{NativeStorageClientDescriptor}}} calls
{{{rref.notifyOnDisconnect(self._lost)}}}. If foolscap is generating
redundant {{{rref}}}s, then maybe the {{{self._lost}}} will eventually be
called by one of the old ones that is no longer referenced by the
{{{NativeStorageClientDescriptor}}}, and if so then the current {{{rref}}}
will get spuriously removed?
Maybe we should add some sanity checking to
{{{NativeStorageClientDescriptor}}} such as asserting that when its
{{{_got_connection()}}} is called that its {{{self.rref}}} is required to
be currently {{{None}}}, or asserting that when its {{{self._lost()}}}
method is called that the current value of {{{self.rref}}} is the same as
it was when the {{{notifyOnDisconnect()}}} was called? In the interests
of a stable v1.5, maybe we could make it so that if these checks fail it
does something "safe", such as if {{{self._lost()}}} is called when
{{{self.rref}}} is not equal to the same thing that it was when the
{{{notifyOnDisconnect()}}} was called, that it leaves the {{{self.rref}}}
alone (as well as logging a high-scariness incident.
What do you think?
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/653#comment:16>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list