Opened at 2013-07-22T22:26:34Z
Last modified at 2015-03-31T16:48:00Z
#2034 closed defect
Update for mutable files is sensitive to the number of servers — at Version 10
Reported by: | markberger | Owned by: | markberger |
---|---|---|---|
Priority: | normal | Milestone: | 1.10.1 |
Component: | code-mutable | Version: | 1.10.0 |
Keywords: | mutable publish review-needed | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
When the number of servers used in a grid is greater than or equal to N + k, there is a race condition when mutable files are updated. While waiting for responses, ServermapUpdater will hit a threshold after EPSILON servers do not have any shares (currently EPSILON = k). This causes the process to finish and reject any remaining responses.
However, this can occur before the ServermapUpdater has added the server to the servermap, causing the client to reject the server's response. Since the server is not added to the servermap, a writer for that server is not created. This causes a key error when the updater attempts to retrieve the corresponding writer for each share.
This race condition can be replicated by setting the number of servers for allmydata.test.test_mutable.Update to 13.
=============================================================================== [FAIL] Traceback (most recent call last): File "/Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_mutable.py", line 3663, in <lambda> self.failUnlessEqual(results, new_data)) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/trial/_synctest.py", line 356, in assertEqual % (msg, pformat(first), pformat(second))) twisted.trial.unittest.FailTest: not equal: a = 'test datatest datatest datatest datatest datatest datatest datatest datatest datatest data' b = 'test datatest datatest datatest datatest datatest datatest datatest datatest datatest dataappended' allmydata.test.test_mutable.Update.test_update_sdmf =============================================================================== [ERROR] Traceback (most recent call last): Failure: allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <type 'exceptions.KeyError'>: 0\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:642:_push\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:662:push_segment\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:304:addCallback\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:293:addCallbacks\n--- <exception caught here> ---\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:575:_runCallbacks\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:767:_push_segment\n]", None) allmydata.test.test_mutable.Update.test_append allmydata.test.test_mutable.Update.test_append_power_of_two allmydata.test.test_mutable.Update.test_multiple_segment_replace allmydata.test.test_mutable.Update.test_replace_beginning allmydata.test.test_mutable.Update.test_replace_segstart1 =============================================================================== [ERROR] Traceback (most recent call last): Failure: allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <type 'exceptions.KeyError'>: 1\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:642:_push\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:662:push_segment\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:304:addCallback\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:293:addCallbacks\n--- <exception caught here> ---\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:575:_runCallbacks\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:767:_push_segment\n]", None) allmydata.test.test_mutable.Update.test_replace_and_extend allmydata.test.test_mutable.Update.test_replace_in_last_segment allmydata.test.test_mutable.Update.test_replace_locations allmydata.test.test_mutable.Update.test_replace_middle allmydata.test.test_mutable.Update.test_replace_zero_length_middle =============================================================================== [ERROR] Traceback (most recent call last): File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/filenode.py", line 1177, in _build_uploadable_and_finish return p.update(u, offset, segments_and_bht[2], self._version) File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 341, in update self._push() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 642, in _push return self.push_segment(self._current_segment) File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 659, in push_segment return self._push() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 645, in _push return self.push_everything_else() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 778, in push_everything_else self.push_blockhashes() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 806, in push_blockhashes writers = self.writers[shnum] exceptions.KeyError: 0 allmydata.test.test_mutable.Update.test_replace_zero_length_beginning allmydata.test.test_mutable.Update.test_replace_zero_length_segstart1 ------------------------------------------------------------------------------- Ran 14 tests in 58.570s
Change History (10)
comment:1 Changed at 2013-07-23T13:22:21Z by markberger
- Description modified (diff)
comment:2 Changed at 2013-07-23T16:20:37Z by markberger
- Description modified (diff)
- Summary changed from Mutable update tests are sensitive to the number of servers to Update for mutable files is sensitive to the number of servers
comment:3 Changed at 2013-07-23T16:26:49Z by markberger
comment:4 Changed at 2013-07-23T17:14:25Z by markberger
- Keywords review-needed added
- Milestone undecided deleted
- Owner set to markberger
Here is a patch that will solve this problem: https://github.com/markberger/tahoe-lafs/tree/2034-mutable-update-race-condition
comment:5 Changed at 2013-07-23T22:55:32Z by markberger
This patch is necessary in order to have tests for #1057 with a reasonably large number of servers.
comment:6 Changed at 2013-08-05T23:13:46Z by markberger
- Description modified (diff)
comment:7 Changed at 2013-08-28T16:00:01Z by zooko
- Milestone set to 1.11.0
comment:8 Changed at 2013-08-28T16:02:57Z by daira
- Keywords blocks-release added
comment:9 Changed at 2013-09-20T19:35:20Z by zooko
I just had a look at https://github.com/markberger/tahoe-lafs/commit/08f70b511345c9ae267ae90fe31adf5cee1fd723, but I wasn't sure I understood the issue. I thought maybe looking at a unit test would help me understand what this patch is fixing, but there is no unit test included in the patch. Then I came back and read this ticket's original message and started to understand it better.
However, I still don't really understand it.
However, this can occur before the ServermapUpdater? has added the server to the servermap, causing the client to reject the server's response. Since the server is not added to the servermap, a writer for that server is not created.
When can this happen? Is this if there are > K servers in existence, and K servers have responded to queries, saying that they have no shares (or, perhaps some servers fail instead of responding "I have no shares"), and then… what happens? Is this server that "is not added to the servermap", mentioned above, one of the servers that has not yet been queried? Or it has been queried but it has not yet responded?
Ideas:
- Should we patch the unit tests, as described in the original message::
"This race condition can be replicated by setting the number of servers for allmydata.test.test_mutable.Update to 13."
?
- Should the resulting unit test have a comment giving this ticket # and URL to this ticket?
comment:10 Changed at 2013-09-20T19:36:33Z by zooko
- Description modified (diff)
Possible related tickets: #549, #1042