[tahoe-lafs-trac-stream] [tahoe-lafs] #1679: Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1
tahoe-lafs
trac at tahoe-lafs.org
Sun Oct 28 03:47:48 UTC 2012
#1679: Nondeterministic NoSharesError for direct CHK download in 1.8.3 and 1.9.1
--------------------------+------------------------------------
Reporter: nejucomo | Owner: nejucomo
Type: defect | Status: new
Priority: critical | Milestone: soon
Component: code | Version: 1.8.3
Resolution: | Keywords: download heisenbug lae
Launchpad Bug: |
--------------------------+------------------------------------
Comment (by zooko):
Okay, I'm leaving this under this ticket on the assumption that nejucomo's
bug was the same as thedod's. If nejucomo could confirm or disconfirm
that, I would appreciate it!
Thedod's bug proved to be due to caching of filenode objects in
nodemaker.py. An immutable filenode object has failed to download (due to
thedod's internet connections being flakey), and then a new download of
that same filenode object is attempted. For some reason the old one is
still in place (I don't know why) so the caching causes the new download
to use that old filenode object, so it immediately fails too.
Brian diagnosed it in the Tahoe-LAFS Sixth Birthday Party hangout. I wrote
a patch which thedod is testing on his live site:
https://dubiousdod.org/uri/URI:DIR2-RO:4ihykxpazbbd3b4y527o5ibeom:edmce7aavjq7lh7git6p4bm5errgc4t65ndewhpivz7pqjj37l7q/jetpack.html#[[28%20October%202012%5D%5D
The patch, which I'll attach, solves it by excluding immutables from the
filenode cache. Brian and I searched for ways that mutables could suffer
from the same sort of problem and concluded that there aren't any, and
mutables need the cache for making the write-lock work, so we're leaving
it with mutables being cached.
There could be much nicer solutions to this, involving caching the
filenode objects and making them resume work after failure, either because
they've reached the end of their rope (I don't like that approach so
much), just constantly renew their work as a part of how they do their
work (serendipity, trying to call back servers which used to work really
well but then disconnected, etc.) (I like that approach a bit better), or
in response to a user initiating a new download (I like that approach a
lot: when a user indicates interest, the programs should pull out all the
stops!).
We still need to write a unit test that demonstrates thedod's bug
happening, and confirm that this patch makes that test go from red to
green, and commit this to trunk, and then talk about whether we want to
release a 1.9.3 release just for this bugfix, or make people wait for
1.10. This hits LeastAuthority customers, because they have only one
storage server. Other users probably have this problem masked by automatic
fail-over to other servers when the first few servers get stuck in this
state.
Also other users probably have a more reliable internet connection than
thedod has.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1679#comment:17>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list