[tahoe-dev] [tahoe-lafs] #1191: unit test failure: failed to download file with 2 shares on one server and one share on another

Tue Sep 7 05:56:59 UTC 2010

#1191: unit test failure: failed to download file with 2 shares on one server and
one share on another
------------------------------------+---------------------------------------
     Reporter:  zooko               |       Owner:                    
         Type:  defect              |      Status:  new               
     Priority:  major               |   Milestone:  1.8.0             
    Component:  code-peerselection  |     Version:  1.8β              
   Resolution:                      |    Keywords:  immutable download
Launchpad Bug:                      |  
------------------------------------+---------------------------------------

Comment (by zooko):

 Okay here is a log created by instrumenting {{{ShareFinder}}} to log each
 method call and {{{DownloadNode}}} to log {{{got_shares}}} and
 {{{no_more_shares}}}. This is without [attachment:1191-fix.diff].
 {{{
 local#231 23:01:59.969: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>._request_retired(<allmydata.immutable.downloader.finder.RequestToken
 instance at 0x104ef9c68>)
 local#232 23:01:59.969: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>._got_response({9: <allmydata.test.no_network.LocalWrapper
 instance at 0x1049dc3f8>},
 {'http://allmydata.org/tahoe/protocols/storage/v1': {'maximum-immutable-
 share-size': 128704925696, 'tolerates-immutable-read-overrun': True,
 'delete-mutable-shares-with-zero-length-writev': True}, 'application-
 version': 'allmydata-tahoe/1.8.0c3-r4715'},
 <B9><A3>N<80>u<9C>_<F7><97>FSS<A7><BD>^B<F9>f$:      ,
 <allmydata.immutable.downloader.finder.RequestToken instance at
 0x104ef9c68>, <allmydata.immutable.downloader.status.DYHBEvent instance at
 0x104ef9cb0>, 1283835719.96, 210)
 local#233 23:01:59.969: got shnums [9] from [xgru5adv]
 local#234 23:01:59.969: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>._create_share(9, <allmydata.test.no_network.LocalWrapper
 instance at 0x1049dc3f8>,
 {'http://allmydata.org/tahoe/protocols/storage/v1': {'maximum-immutable-
 share-size': 128704925696, 'tolerates-immutable-read-overrun': True,
 'delete-mutable-shares-with-zero-length-writev': True}, 'application-
 version': 'allmydata-tahoe/1.8.0c3-r4715'},
 <B9><A3>N<80>u<9C>_<F7><97>FSS<A7><BD>^B<F9>f$:        , 0.0133891105652)
 local#235 23:01:59.970: Share(sh9-on-xgru5) created
 local#236 23:01:59.970: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>._deliver_shares([Share(sh9-on-xgru5)])
 local#237 23:01:59.970: delivering shares: Share(sh9-on-xgru5)
 local#238 23:01:59.970: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>.loop()
 local#239 23:01:59.970: ShareFinder loop: running=True hungry=False,
 pending=
 local#240 23:01:59.971: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>.loop()
 local#241 23:01:59.971: ShareFinder loop: running=True hungry=False,
 pending=
 local#242 23:01:59.972: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>.hungry()
 local#243 23:01:59.972: ShareFinder[si=dglevpj4ueb7] hungry
 local#244 23:01:59.972: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>.start_finding_servers()
 local#245 23:01:59.973: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>.loop()
 local#246 23:01:59.973: ShareFinder loop: running=True hungry=True,
 pending=
 local#247 23:01:59.973: ShareFinder.loop: no_more_shares, ever
 local#248 23:01:59.973: xxx
 ImmutableDownloadNode(dglevpj4ueb7).no_more_shares() ; _active_segment:
 <allmydata.immutable.downloader.fetcher.SegmentFetcher instance at
 0x1049f6638>
 local#249 23:01:59.975: xxx
 <allmydata.immutable.downloader.finder.ShareFinder instance at
 0x1049f6050>.loop()
 local#250 23:01:59.975: ShareFinder loop: running=True hungry=True,
 pending=
 local#251 23:01:59.975: ShareFinder.loop: no_more_shares, ever
 local#252 23:01:59.975: xxx
 ImmutableDownloadNode(dglevpj4ueb7).no_more_shares() ; _active_segment:
 <allmydata.immutable.downloader.fetcher.SegmentFetcher instance at
 0x1049f6638>
 local#253 23:01:59.976: ran out of shares: complete=sh1,sh8 pending=
 overdue= unused= need 3. Last failure: None
 local#254 23:01:59.976: SegmentFetcher(dglevpj4ueb7).stop
 local#255 23:01:59.977: xxx
 ImmutableDownloadNode(dglevpj4ueb7).got_shares([Share(sh9-on-xgru5)])
 }}}

 (Sorry about the wide lines there.)

 So at "local#231 23:01:59.969" the request is retired but the resulting
 eventual {{{got_shares}}} won't happen until "local#255 23:01:59.977"
 which is shortly after the {{{loop}}} at "local#247 23:01:59.973" which
 said {{{no_more_shares, ever}}}, which set a flag named
 {{{_no_more_shares}}} in the {{{SegmentFetcher}}} so that the next time
 {{{SegmentFetcher._do_loop}}} runs then it gives up and says {{{ran out of
 shares}}} at "local#253 23:01:59.976".

 Now [attachment:1191-fix.diff] makes it so that when {{{loop}}} decides
 {{{no_more_shares, ever}}} then it sets an eventual task to set the
 {{{_no_more_shares}}} flag in {{{SegmentFetcher}}} instead of doing it
 immediately. Is this guaranteed to always prevent this bug? I guess it is
 because when the {{{_request_retired}}} (local#231 23:01:59.969) is done
 immediately then during that same tick the {{{got_shares}}} (local#255
 23:01:59.977) is put on the eventual queue, so when the setting of
 {{{_no_more_shares}}} is put on the eventual queue it will always go take
 effect after the {{{got_shares}}} does.

 Okay.

 But this still feels fragile to me, for example after we apply
 [attachment:1191-fix.diff], then if someone were to change the code of
 {{{ShareFinder._got_response}}} so that it invoked {{{_deliver_shares}}}
 eventually instead of immediately, or if they were to change
 {{{_deliver_shares}}} so that it invoked {{{DownloadNode.got_shares}}}
 eventually instead of immediately, or if they were to change
 {{{DownloadNode.got_shares}}} so that it updated its {{{_shares}}} data
 structure eventually instead of immediately, then that would reintroduce
 this bug.

 It would feel nicer to me if we could update both the
 {{{ShareFinder.pending_requests}}} data structure and the
 {{{DownloadNode._shares}}} data structure in the same immediate call so
 that there is no tick that begins when those two data structures are in a
 mutually inconsistent state (with the request removed from the former but
 the share not yet added to the latter).

 Okay now I'll try to make a narrow test case of this issue.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1191#comment:12>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage