[tahoe-dev] [tahoe-lafs] #1191: unit test failure: failed to download file with 2 shares on one server and one share on another
tahoe-lafs
trac at tahoe-lafs.org
Tue Sep 7 05:56:59 UTC 2010
#1191: unit test failure: failed to download file with 2 shares on one server and
one share on another
------------------------------------+---------------------------------------
Reporter: zooko | Owner:
Type: defect | Status: new
Priority: major | Milestone: 1.8.0
Component: code-peerselection | Version: 1.8β
Resolution: | Keywords: immutable download
Launchpad Bug: |
------------------------------------+---------------------------------------
Comment (by zooko):
Okay here is a log created by instrumenting {{{ShareFinder}}} to log each
method call and {{{DownloadNode}}} to log {{{got_shares}}} and
{{{no_more_shares}}}. This is without [attachment:1191-fix.diff].
{{{
local#231 23:01:59.969: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>._request_retired(<allmydata.immutable.downloader.finder.RequestToken
instance at 0x104ef9c68>)
local#232 23:01:59.969: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>._got_response({9: <allmydata.test.no_network.LocalWrapper
instance at 0x1049dc3f8>},
{'http://allmydata.org/tahoe/protocols/storage/v1': {'maximum-immutable-
share-size': 128704925696, 'tolerates-immutable-read-overrun': True,
'delete-mutable-shares-with-zero-length-writev': True}, 'application-
version': 'allmydata-tahoe/1.8.0c3-r4715'},
<B9><A3>N<80>u<9C>_<F7><97>FSS<A7><BD>^B<F9>f$: ,
<allmydata.immutable.downloader.finder.RequestToken instance at
0x104ef9c68>, <allmydata.immutable.downloader.status.DYHBEvent instance at
0x104ef9cb0>, 1283835719.96, 210)
local#233 23:01:59.969: got shnums [9] from [xgru5adv]
local#234 23:01:59.969: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>._create_share(9, <allmydata.test.no_network.LocalWrapper
instance at 0x1049dc3f8>,
{'http://allmydata.org/tahoe/protocols/storage/v1': {'maximum-immutable-
share-size': 128704925696, 'tolerates-immutable-read-overrun': True,
'delete-mutable-shares-with-zero-length-writev': True}, 'application-
version': 'allmydata-tahoe/1.8.0c3-r4715'},
<B9><A3>N<80>u<9C>_<F7><97>FSS<A7><BD>^B<F9>f$: , 0.0133891105652)
local#235 23:01:59.970: Share(sh9-on-xgru5) created
local#236 23:01:59.970: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>._deliver_shares([Share(sh9-on-xgru5)])
local#237 23:01:59.970: delivering shares: Share(sh9-on-xgru5)
local#238 23:01:59.970: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>.loop()
local#239 23:01:59.970: ShareFinder loop: running=True hungry=False,
pending=
local#240 23:01:59.971: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>.loop()
local#241 23:01:59.971: ShareFinder loop: running=True hungry=False,
pending=
local#242 23:01:59.972: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>.hungry()
local#243 23:01:59.972: ShareFinder[si=dglevpj4ueb7] hungry
local#244 23:01:59.972: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>.start_finding_servers()
local#245 23:01:59.973: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>.loop()
local#246 23:01:59.973: ShareFinder loop: running=True hungry=True,
pending=
local#247 23:01:59.973: ShareFinder.loop: no_more_shares, ever
local#248 23:01:59.973: xxx
ImmutableDownloadNode(dglevpj4ueb7).no_more_shares() ; _active_segment:
<allmydata.immutable.downloader.fetcher.SegmentFetcher instance at
0x1049f6638>
local#249 23:01:59.975: xxx
<allmydata.immutable.downloader.finder.ShareFinder instance at
0x1049f6050>.loop()
local#250 23:01:59.975: ShareFinder loop: running=True hungry=True,
pending=
local#251 23:01:59.975: ShareFinder.loop: no_more_shares, ever
local#252 23:01:59.975: xxx
ImmutableDownloadNode(dglevpj4ueb7).no_more_shares() ; _active_segment:
<allmydata.immutable.downloader.fetcher.SegmentFetcher instance at
0x1049f6638>
local#253 23:01:59.976: ran out of shares: complete=sh1,sh8 pending=
overdue= unused= need 3. Last failure: None
local#254 23:01:59.976: SegmentFetcher(dglevpj4ueb7).stop
local#255 23:01:59.977: xxx
ImmutableDownloadNode(dglevpj4ueb7).got_shares([Share(sh9-on-xgru5)])
}}}
(Sorry about the wide lines there.)
So at "local#231 23:01:59.969" the request is retired but the resulting
eventual {{{got_shares}}} won't happen until "local#255 23:01:59.977"
which is shortly after the {{{loop}}} at "local#247 23:01:59.973" which
said {{{no_more_shares, ever}}}, which set a flag named
{{{_no_more_shares}}} in the {{{SegmentFetcher}}} so that the next time
{{{SegmentFetcher._do_loop}}} runs then it gives up and says {{{ran out of
shares}}} at "local#253 23:01:59.976".
Now [attachment:1191-fix.diff] makes it so that when {{{loop}}} decides
{{{no_more_shares, ever}}} then it sets an eventual task to set the
{{{_no_more_shares}}} flag in {{{SegmentFetcher}}} instead of doing it
immediately. Is this guaranteed to always prevent this bug? I guess it is
because when the {{{_request_retired}}} (local#231 23:01:59.969) is done
immediately then during that same tick the {{{got_shares}}} (local#255
23:01:59.977) is put on the eventual queue, so when the setting of
{{{_no_more_shares}}} is put on the eventual queue it will always go take
effect after the {{{got_shares}}} does.
Okay.
But this still feels fragile to me, for example after we apply
[attachment:1191-fix.diff], then if someone were to change the code of
{{{ShareFinder._got_response}}} so that it invoked {{{_deliver_shares}}}
eventually instead of immediately, or if they were to change
{{{_deliver_shares}}} so that it invoked {{{DownloadNode.got_shares}}}
eventually instead of immediately, or if they were to change
{{{DownloadNode.got_shares}}} so that it updated its {{{_shares}}} data
structure eventually instead of immediately, then that would reintroduce
this bug.
It would feel nicer to me if we could update both the
{{{ShareFinder.pending_requests}}} data structure and the
{{{DownloadNode._shares}}} data structure in the same immediate call so
that there is no tick that begins when those two data structures are in a
mutually inconsistent state (with the request removed from the former but
the share not yet added to the latter).
Okay now I'll try to make a narrow test case of this issue.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1191#comment:12>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-dev
mailing list