#474 new defect

uncaught exception in mutable-retrieve: UCW between mapupdate and retrieve

Reported by: warner Owned by:
Priority: major Milestone: soon
Component: code-mutable Version: 1.1.0
Keywords: mutable upload download hang reliability test-needed ucwe Cc:
Launchpad Bug:


If a mutable file is modified (by some external uncoordinated writer) after we've done a servermap update but before we do a retrieve, then the exception raised by mutable.Retrieve._maybe_send_more_queries (source:src/allmydata/mutable/retrieve.py#L415) is not caught by the Deferred chain. This means that the caller (who is probably in a filenode.modify sequence) will be waiting forever for the answer.

The exception that shows up in the twisted log looks like this:

2008-06-22 00:19:31.726Z [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 667, in _done
          File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 53, in complete
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 239, in callback
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 304, in _startRunCallbacks
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py", line 325, in _check_for_done
            return self._maybe_send_more_queries(k)
          File "/usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py", line 415, in _maybe_send_more_queries
            raise err
        allmydata.encode.NotEnoughSharesError: ran out of peers: have 0 shares (k=3), 2 queries in flight, need 1 more, found 0 bad shares, last failure: [Failure instance: Traceback: <class 'allmydata.mutable.common.UncoordinatedWriteError'>: someone wrote to the data since we read the servermap: prefix changed
        --- <exception caught here> ---

The Deferred chaining needs to be investigated to make sure that this exception is properly returned to the caller via the errback on their Deferred.

In addition, the code in filenode.modify needs to be examined to make sure that this kind of uncoordinated write error is caught and retried. My concern is that Retrieve is returning a NotEnoughSharesError that wraps a UncoordinatedWriteError, rather than the UCWE directly, and that the f.trap in modify() might not know to look for that.

Change History (10)

comment:1 Changed at 2008-06-23T20:30:31Z by warner

in addition, I think we need test coverage for an UCWE that occurs between mapupdate and retrieve. I believe that we already have coverage for one that occurs between retrieve and publish, but not before the retrieve. Note that this will require using a file that is large enough to not fit in the filenode's read-cache (20kB should be enough).

comment:2 Changed at 2008-06-24T16:19:41Z by warner

Unit tests to write:

  • do mapupdate(MODE_READ), remove all storage servers, do Retrieve -> should fail
  • do mapupdate, disable all storage servers (return error), do Retrieve -> failure
  • do mapupdate, rollback all storage servers, do Retrieve(ver) -> failure
  • do mapupdate, corrupt all shares, do Retrieve -> failure
  • do modify, between the mapupdate and the retrieve rollback all servers -> retry

comment:3 Changed at 2008-06-24T18:09:16Z by warner

hrm, a quick check of the code doesn't reveal any paths by which _check_for_done is called outside of a Deferred chain. I added a few unit tests dba0ad3d2841b7a0 to corrupt all shares between the mapupdate and retrieve, and they pass. I also increased the size of the test file to 20kB to allow the post-mapupdate corruption to actually get noticed.

comment:4 Changed at 2008-06-25T00:20:36Z by warner

  • Milestone changed from 1.1.1 to eventually
  • Priority changed from critical to major

At least one of these errors occurred while the webapi server in question was throwing out-of-memory errors, so I think I'm going to wait until it happens again before investigating it deeply. For reference, the times (UTC) at which these errors occurred were:

2008-06-21 22:53:42.575Z    (06-21 15:53 PDT)
2008-06-22 00:00:57.964Z    (06-21 17:00 PDT)
2008-06-22 00:05:58.949Z    (06-21 17:05 PDT)
2008-06-22 00:19:31.726Z    (06-21 17:19 PDT)

with a MemoryError occurring at 2008-06-22 00:25:46.020Z . We have no events logged from this node (2b6w) between 17:02 and 17:36. Because of the foolscap log-buffer overflow problem, we have no foolscap log events past about 15:25.

comment:5 Changed at 2009-12-29T20:07:36Z by warner

  • Component changed from code-encoding to code-mutable

comment:6 Changed at 2010-02-27T08:45:23Z by davidsarah

  • Keywords upload download hang reliability test added
  • Milestone changed from eventually to 1.7.0

comment:7 Changed at 2010-03-24T23:05:52Z by davidsarah

  • Keywords ucwe added

comment:8 Changed at 2010-05-26T14:51:46Z by zooko

  • Milestone changed from 1.7.0 to 1.8.0

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

comment:9 Changed at 2010-08-10T04:16:32Z by davidsarah

  • Keywords test-needed added; test removed
  • Milestone changed from 1.8.0 to 1.9.0

comment:10 Changed at 2011-07-16T21:02:52Z by davidsarah

  • Milestone changed from 1.9.0 to soon
Note: See TracTickets for help on using tickets.