#1154 closed defect (fixed)

mplayer triggers two bugs in Tahoe's new downloader

Reported by: francois Owned by: warner
Priority: critical Milestone: 1.8.0
Component: code-network Version: 1.8β
Keywords: download regression random-access error Cc: francois@…
Launchpad Bug:

Description (last modified by francois)

I tried to play a movie hosted on a Tahoe-LAFS grid with 'mplayer http://127.0.0.1:3456/file/URI-blah/@@named=/test.avi'. The error seems to be related to mplayer seeking at the end of the file for downloading the AVI index before seeking back to the beginning for the actual rendering.

Preventing mplayer from reading the index with the '--noidx' parameter avoid this error.

2010-08-03 20:06:40+0200 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib/python2.6/dist-packages/twisted/application/app.py", line 348, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run
	    self.mainLoop()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
	    self.runUntilCurrent()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	--- <exception caught here> ---
	  File "/usr/lib/pymodules/python2.6/foolscap/eventual.py", line 26, in _turn
	    cb(*args, **kwargs)
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 197, in loop
	    self._do_loop()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 253, in _do_loop
	    wanted, needed = self._desire()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 611, in _desire
	    % (want_it.dump(), need_it.dump(), gotta_gotta_have_it.dump()))
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/util/spans.py", line 143, in dump
	    return "len=%d: %s" % (len(self),
	exceptions.TypeError: __len__() should return an int
2010-08-03 20:06:51+0200 [HTTPChannel,198,127.0.0.1] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib/python2.6/dist-packages/twisted/application/app.py", line 348, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run
	    self.mainLoop()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1182, in mainLoop
	    self.doIteration(t)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/selectreactor.py", line 140, in doSelect
	    _logrun(selectable, _drdw, selectable, method, dict)
	--- <exception caught here> ---
	  File "/usr/lib/python2.6/dist-packages/twisted/python/log.py", line 84, in callWithLogger
	    return callWithContext({"system": lp}, func, *args, **kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/python/log.py", line 69, in callWithContext
	    return context.call({ILogContext: newCtx}, func, *args, **kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/python/context.py", line 59, in callWithContext
	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/python/context.py", line 37, in callWithContext
	    return func(*args,**kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/selectreactor.py", line 156, in _doReadOrWrite
	    self._disconnectSelectable(selectable, why, method=="doRead")
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/posixbase.py", line 191, in _disconnectSelectable
	    selectable.readConnectionLost(f)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/tcp.py", line 508, in readConnectionLost
	    self.connectionLost(reason)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/tcp.py", line 513, in connectionLost
	    abstract.FileDescriptor.connectionLost(self, reason)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/abstract.py", line 64, in connectionLost
	    self.producer.stopProducing()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/segmentation.py", line 146, in stopProducing
	    self._cancel_segment_request.cancel()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/node.py", line 27, in cancel
	    self._f(self)
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/node.py", line 468, in _cancel_request
	    if self._active_segment.segnum not in segnums:
	exceptions.AttributeError: 'NoneType' object has no attribute 'segnum'

Attachments (2)

mplayer-with-idx.log (557.5 KB) - added by francois at 2010-08-04T20:31:00Z.
mplayer-console.log (1.2 KB) - added by francois at 2010-08-04T20:31:09Z.

Download all attachments as: .zip

Change History (14)

comment:1 Changed at 2010-08-03T18:35:08Z by francois

  • Keywords download added

comment:2 Changed at 2010-08-03T18:36:43Z by francois

This ticket is related to #798.

comment:3 Changed at 2010-08-03T21:38:59Z by warner

  • Milestone changed from undecided to 1.8.0
  • Status changed from new to assigned

ooh, good bugs. I'll look into them.

comment:4 Changed at 2010-08-04T01:36:31Z by davidsarah

  • Component changed from code to code-network
  • Keywords regression random-access error added
  • Priority changed from major to critical
  • Version changed from n/a to 1.8β

comment:5 Changed at 2010-08-04T01:37:25Z by davidsarah

  • Description modified (diff)

comment:6 Changed at 2010-08-04T07:53:55Z by francois

  • Description modified (diff)

comment:7 Changed at 2010-08-04T19:15:09Z by warner

I figured out the first one: it turns out that webapi GETs that use a Range header were completely broken: the root cause was that python's __len__ method is, apparently, not allowed to return a long (who knew?). I just pushed the patch (8844655705e4fb76).

I'm still trying to figure out the stopProducing exception: it feels like the HTTP client dropped the connection at an unexpected time (when no actual segments were being fetched). There's a simple fix, but I want to build a reproducible test case first.

comment:8 Changed at 2010-08-04T20:30:39Z by francois

Hi Brian, Thanks for looking into this!

I've attached the full debug log (mplayer-with-idx.log) as well as the mplayer console output (mplayer-console.log) to this ticket. I hope this helps.

BTW, during this test the file download doesn't stop when mplayer exits. It is still completely downloaded.

And, when mplayer is run a second time, after Tahoe finished downloading the whole file, it plays the movie just well and even allows seeking through it.

Changed at 2010-08-04T20:31:00Z by francois

Changed at 2010-08-04T20:31:09Z by francois

comment:9 Changed at 2010-08-05T15:33:12Z by warner

Hm, it looks like mplayer-with-idx.log is actually using the old downloader: the log messages don't match the new code.

I've found a few places where that second exception could possibly be raised, but I'm still trying to find a way to reproduce it for a test.

comment:10 Changed at 2010-08-05T17:40:53Z by warner

Oh, nevermind, I think I figured it out. There's actually three bugs overlapping here:

  1. the Spans/DataSpans classes used __len__ methods that returned longs instead of ints, causing an exception during download. (my 8844655705e4fb76 fix was incorrect: it turns out that __nonzero__ is not allowed to return a long either).
  2. there is a lost-progress bug in DownloadNode, where a failure in one segment-fetch will cause all other pending segment-fetches to hang forever
  3. a stopProducing that occurs during this hang-forever period causes an exception, because there is no active segment-fetch in place

The bug1 fix is easy: replace self.__len__ with self.len and make __nonzero__ always return a bool. The bug3 fix is also easy: DownloadNode._cancel_request should tolerate self._active_segment being None.

The bug2 fix is not trivial but not hard. The start-next-fetch code in DownloadNode should be factored out, and DownloadNode.fetch_failed code should invoke it after sending errbacks to the requests which failed. This will add a nice property: if you get unrecoverable bit errors in one segment, you might still be able to get valid data from other segments (as opposed to giving up on the whole file because of a single error). I think there are some other changes that must be made to really get this property, though.. when we get to the point where we sort shares by "goodness", we'll probably clean this up. The basic idea will be that shares with errors go to the bottom of the list but are not removed from it entirely: if we really can't find the data we need somewhere else, we'll give the known-corrupted share a try, in the hopes that there are some uncorrupted parts of the share.

I've got a series of test cases to exercise these three bugs.. I just have to build them in the right order to make sure that I'm not fixing the wrong one first (and thus hiding one of the others from my test).

comment:11 Changed at 2010-08-05T18:15:21Z by warner

Oh, I should mention what I'm guessing mplayer was doing. I think it issued two simultaneous parallel webapi GET requests: one for the index near the end of the file, and another for the first block of the file. They both would have been delivered to the same FileNode instance, creating two Segmentation requests (one for each read() call), creating two calls to DownloadNode.get_segment(). The second one would wait for the first one to finish, since to keep the memory footprint low, DownloadNode only works on one segment at a time. The first segment-fetch failed because of the __len__ bug, leaving the second fetch hanging (because of the lost-progress bug). When mplayer got an error on the index GET, I believe it gave up on the other GET, dropping the HTTP connection, causing connectionLost and stopProducing, causing a DownloadLoad._cancel_request when no segment-fetch was active, triggering the third bug.

comment:12 Changed at 2010-08-05T19:23:36Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed

Should be fixed now, in 43c5032105288a58 and f6f9a97627d210a6.

Note: See TracTickets for help on using tickets.