[tahoe-lafs-trac-stream] [tahoe-lafs] #1679: Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1

Wed Oct 31 10:02:56 UTC 2012

#1679: Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1
-------------------------+-------------------------------------------------
     Reporter:           |      Owner:  nejucomo
  nejucomo               |     Status:  new
         Type:  defect   |  Milestone:  soon
     Priority:           |    Version:  1.8.3
  critical               |   Keywords:  download heisenbug lae test-needed
    Component:  code-    |  review-needed
  network                |
   Resolution:           |
Launchpad Bug:           |
-------------------------+-------------------------------------------------

Comment (by zooko):

 Replying to [comment:20 nejucomo]:
 > Can I test this manually without waiting for or writing a unittest?
 >
 > In order to do so, I need some more clarification:
 >
 > Where does the invalid cache live?  Is it in the downloading gateway?

 The cache is in the downloading gateway.

 > Does that mean if the gateway cannot connect to the storage server
 during an immutable download, then the cache records this fact and is not
 correctly bypassed later?

 I think so. I just checked the code
 ([source:git/src/allmydata/immutable/download/finder.py) which I think is
 at fault. It asks the storage broker for a list of connected servers when
 it starts, then it tries to use the servers. I think if the server is
 excluded from that list by storage broker because it isn't connected, or
 if finder tries to use the server and gets an error (because the
 connection just failed), then finder will never again try to use that
 server. The cache causes new downloads to use the same finder object.

 > If all those are true, a manual test would be:
 >
 > a. Pick a known-uploaded CHK cap which has *not* been recently
 downloaded by the target gateway.
 >
 > b. Prevent the gateway from connecting to relevant storage servers.
 (For LAE service this is easier because there's only one storage node;
 ifdown $iface can work for a local gateway test, or adding a special
 temporary black-hole route for the storage node IP might work for a remote
 gateway.)
 >
 > c. Attempt to fetch that CAP on the network-impaired gateway.
 >
 > d. Repair the network of the gateway.
 >
 > e. Attempt to fetch that CAP again.  If the fetch fails, this is
 evidence of the bug.  If not, there's some flaw in these assumptions.
 >
 > f. If e. produces evidence of the bug, then stop that gateway, apply the
 patch, start the patched gateway and repeat steps a. through e. (with a
 *new* CAP to help control the experiment).
 >
 > g. Publish the results of step e. in the first iteration (unpatched) and
 the second iteration (patched).
 >
 > At the same time or afterwards, write a unittest.

 This sounds like a good protocol!

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1679#comment:21>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage