[tahoe-lafs-trac-stream] [tahoe-lafs] #1679: Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1

Sun Oct 28 03:47:48 UTC 2012

#1679: Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1
--------------------------+------------------------------------
     Reporter:  nejucomo  |      Owner:  nejucomo
         Type:  defect    |     Status:  new
     Priority:  critical  |  Milestone:  soon
    Component:  code      |    Version:  1.8.3
   Resolution:            |   Keywords:  download heisenbug lae
Launchpad Bug:            |
--------------------------+------------------------------------

Comment (by zooko):

 Okay, I'm leaving this under this ticket on the assumption that nejucomo's
 bug was the same as thedod's. If nejucomo could confirm or disconfirm
 that, I would appreciate it!

 Thedod's bug proved to be due to caching of filenode objects in
 nodemaker.py. An immutable filenode object has failed to download (due to
 thedod's internet connections being flakey), and then a new download of
 that same filenode object is attempted. For some reason the old one is
 still in place (I don't know why) so the caching causes the new download
 to use that old filenode object, so it immediately fails too.

 Brian diagnosed it in the Tahoe-LAFS Sixth Birthday Party hangout. I wrote
 a patch which thedod is testing on his live site:
 https://dubiousdod.org/uri/URI:DIR2-RO:4ihykxpazbbd3b4y527o5ibeom:edmce7aavjq7lh7git6p4bm5errgc4t65ndewhpivz7pqjj37l7q/jetpack.html#[[28%20October%202012%5D%5D

 The patch, which I'll attach, solves it by excluding immutables from the
 filenode cache. Brian and I searched for ways that mutables could suffer
 from the same sort of problem and concluded that there aren't any, and
 mutables need the cache for making the write-lock work, so we're leaving
 it with mutables being cached.

 There could be much nicer solutions to this, involving caching the
 filenode objects and making them resume work after failure, either because
 they've reached the end of their rope (I don't like that approach so
 much), just constantly renew their work as a part of how they do their
 work (serendipity, trying to call back servers which used to work really
 well but then disconnected, etc.) (I like that approach a bit better), or
 in response to a user initiating a new download (I like that approach a
 lot: when a user indicates interest, the programs should pull out all the
 stops!).

 We still need to write a unit test that demonstrates thedod's bug
 happening, and confirm that this patch makes that test go from red to
 green, and commit this to trunk, and then talk about whether we want to
 release a 1.9.3 release just for this bugfix, or make people wait for
 1.10. This hits LeastAuthority customers, because they have only one
 storage server. Other users probably have this problem masked by automatic
 fail-over to other servers when the first few servers get stuck in this
 state.

 Also other users probably have a more reliable internet connection than
 thedod has.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1679#comment:17>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage