[tahoe-lafs-trac-stream] [Tahoe-LAFS] #541: remove foolscap 'reference'-token bug workaround in mutable publish (was: foolscap 'reference'-token bug workaround in mutable publish)

Tue Aug 25 17:41:44 UTC 2015

#541: remove foolscap 'reference'-token bug workaround in mutable publish
-----------------------------+---------------------------------------------
     Reporter:  warner       |      Owner:
         Type:  defect       |     Status:  new
     Priority:  major        |  Milestone:  1.12.0
    Component:  code-        |    Version:  1.2.0
  mutable                    |   Keywords:  backward-compatibility foolscap
   Resolution:               |
Launchpad Bug:               |
-----------------------------+---------------------------------------------
Description changed by zooko:

Old description:

> Foolscap 0.2.5 had a bug in which an inbound constraint (say
> a{{{TupleConstraint}}}) wouldn't accept a 'reference' token that pointed
> at a tuple. This would occur when the same tuple was sent multiple times
> in a single request.
>
> The mutable-file publish code, when it is creating a brand-new mutable
> file, and when there are fewer than N servers (so multiple shares are
> sent to the same server), can trigger this case, because the same assert-
> that-there-are-no-shares test vector (a static tuple of numbers and
> strings) is used once per share. Python seems to compile the tuple once,
> so foolscap sees the same object appear in the outbound arguments
> multiple times, and uses the 'reference' token to preserve this
> (unintentional) relationship.
>
> The workaround is to construct this tuple at runtime, basically changing:
>
> {{{
>  for x in something:
>    testv = (0, 1, 'eq', '')
>    stuff.append(testv)
> }}}
>
> into:
>
> {{{
>  for x in something:
>    testv = tuple([0, 1, 'eq', ''])
>    stuff.append(testv)
> }}}
>
> The constraint bug was fixed in foolscap-0.2.6, but since it's an
> inbound-side bug, what really matters is the version of foolscap on the
> other end of the wire, so to interoperate with older versions of tahoe
> (which may be using older versions of foolscap), we must avoid sending
> the 'reference' token at all. So this workaround (which was removed in
> [changeset:f7b4c45d4651b25b] because tahoe now requires a newer version
> of foolscap) needs to be reinstated.
>
> In addition, I've seen another problem that relates to this code. A few
> days ago I started noticing failures in the buildbot 'speed test', which
> runs trunk code against a statically configured 'perfnet' (with a client
> node on prodcs2 and four storage nodes plus introducer on tahoebs1). Four
> nodes and ten shares means multiple shares per node, so the testv-and-
> readv-and-writev call gets multiple shares, which means a 'reference'
> token.
>
> The storage server appears to still have problems with the 'reference'
> token, even though it's running foolscap-0.3.1 . Sometimes I see a
> Violation that claims getObject(reftoken) returned a non-tuple. Sometimes
> I see a "dangling reference" !BananaError, which indicates that
> getObject() failed to find the reference. The !BananaError causes the
> connection to be dropped, so any remaining writev calls fail with a
> !DeadReferenceError.
>
> I can provoke this behavior with a local trunk tree against the storage
> nodes on tahoebs1. I haven't been able to reproduce this with local
> storage nodes (running foolscap-0.3.1). If I reintroduce the
> {{{tuple([0,1,'eq',''])}}} workaround, the problem seems to go away.
>
> If I revert the recent versioning changes (which tries to call
> 'get_version' on the storage server, and falls back to a default value if
> that fails), then the problem seems to go away. The tahoebs1 perfnet
> nodes are a few months old (they are running r2901), so they don't have
> {{{get_version()}}}. So I'm suspecting that the Violation that gets
> raised somehow messes things up, such that getObject() remembers some bit
> of bogus state, so later getObject() calls either get a bad value or no
> value at all.
>
> I'm going to commit the {{{tuple([0,1,'eq',''])}}} workaround, since we
> need it anyways. But at some point we need to come back to this and
> examine it further, because I still don't understand the new problem.
>
> The first step will be to add instrumentation to the storage servers,
> which means first bouncing one of the tahoebs1 nodes and see if it still
> has the problem. (running the same code locally didn't experience the
> problem; the only difference I can think of is that the tahoebs1 nodes
> have been running for two months, and might have all experienced
> something in that time to mess them up now, whereas my local nodes are
> younger).

New description:

 To close this ticket, remove the workaround which was needed only to avoid
 a Foolscap bug which is long since fixed.

 ------- old description:

 Foolscap 0.2.5 had a bug in which an inbound constraint (say
 a{{{TupleConstraint}}}) wouldn't accept a 'reference' token that pointed
 at a tuple. This would occur when the same tuple was sent multiple times
 in a single request.

 The mutable-file publish code, when it is creating a brand-new mutable
 file, and when there are fewer than N servers (so multiple shares are sent
 to the same server), can trigger this case, because the same assert-that-
 there-are-no-shares test vector (a static tuple of numbers and strings) is
 used once per share. Python seems to compile the tuple once, so foolscap
 sees the same object appear in the outbound arguments multiple times, and
 uses the 'reference' token to preserve this (unintentional) relationship.

 The workaround is to construct this tuple at runtime, basically changing:

 {{{
  for x in something:
    testv = (0, 1, 'eq', '')
    stuff.append(testv)
 }}}

 into:

 {{{
  for x in something:
    testv = tuple([0, 1, 'eq', ''])
    stuff.append(testv)
 }}}

 The constraint bug was fixed in foolscap-0.2.6, but since it's an inbound-
 side bug, what really matters is the version of foolscap on the other end
 of the wire, so to interoperate with older versions of tahoe (which may be
 using older versions of foolscap), we must avoid sending the 'reference'
 token at all. So this workaround (which was removed in
 [changeset:f7b4c45d4651b25b] because tahoe now requires a newer version of
 foolscap) needs to be reinstated.

 In addition, I've seen another problem that relates to this code. A few
 days ago I started noticing failures in the buildbot 'speed test', which
 runs trunk code against a statically configured 'perfnet' (with a client
 node on prodcs2 and four storage nodes plus introducer on tahoebs1). Four
 nodes and ten shares means multiple shares per node, so the testv-and-
 readv-and-writev call gets multiple shares, which means a 'reference'
 token.

 The storage server appears to still have problems with the 'reference'
 token, even though it's running foolscap-0.3.1 . Sometimes I see a
 Violation that claims getObject(reftoken) returned a non-tuple. Sometimes
 I see a "dangling reference" !BananaError, which indicates that
 getObject() failed to find the reference. The !BananaError causes the
 connection to be dropped, so any remaining writev calls fail with a
 !DeadReferenceError.

 I can provoke this behavior with a local trunk tree against the storage
 nodes on tahoebs1. I haven't been able to reproduce this with local
 storage nodes (running foolscap-0.3.1). If I reintroduce the
 {{{tuple([0,1,'eq',''])}}} workaround, the problem seems to go away.

 If I revert the recent versioning changes (which tries to call
 'get_version' on the storage server, and falls back to a default value if
 that fails), then the problem seems to go away. The tahoebs1 perfnet nodes
 are a few months old (they are running r2901), so they don't have
 {{{get_version()}}}. So I'm suspecting that the Violation that gets raised
 somehow messes things up, such that getObject() remembers some bit of
 bogus state, so later getObject() calls either get a bad value or no value
 at all.

 I'm going to commit the {{{tuple([0,1,'eq',''])}}} workaround, since we
 need it anyways. But at some point we need to come back to this and
 examine it further, because I still don't understand the new problem.

 The first step will be to add instrumentation to the storage servers,
 which means first bouncing one of the tahoebs1 nodes and see if it still
 has the problem. (running the same code locally didn't experience the
 problem; the only difference I can think of is that the tahoebs1 nodes
 have been running for two months, and might have all experienced something
 in that time to mess them up now, whereas my local nodes are younger).

--

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/541#comment:11>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage