Opened at 2010-07-18T16:18:57Z
Last modified at 2021-03-30T18:40:19Z
#1124 new defect
upload is unhappy even though the shares are already distributed
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-peerselection | Version: | 1.7.0 |
Keywords: | upload availability reliability unfinished-business servers-of-happiness | Cc: | mjberger@…, tahoe-lafs.org@… |
Launchpad Bug: |
Description (last modified by markberger)
Here's a test case that I added to src/allmydata/test/test_upload.py:
def test_problem_layout_0123_03_2_1(self): # Zooko stumbled on this case when investigating #1118. self.basedir = self.mktemp() d = self._setup_and_upload(k=2, n=4) # server 0: shares 0, 1, 2, 3 # server 1: shares 0, 3 # server 2: share 1 # server 3: share 2 # With this layout, an upload should just be satisfied that the current distribution is good enough, right? def _setup(ign): self._add_server_with_share(server_number=0, share_number=None) self._add_server_with_share(server_number=1, share_number=0) self._add_server_with_share(server_number=2, share_number=1) self._add_server_with_share(server_number=3, share_number=2) # Copy shares self._copy_share_to_server(3, 1) client = self.g.clients[0] client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4 return client d.addCallback(_setup) d.addCallback(lambda client: client.upload(upload.Data("data" * 10000, convergence=""))) return d
When I run it, it fails like this:
[ERROR]: allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1118 Traceback (most recent call last): File "/Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/upload.py", line 510, in _got_response return self._loop() File "/Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/upload.py", line 363, in _loop self._get_progress_message())) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))
Why does the upload not succeed? I added debugprints and here are some of them:
2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=() 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer ysbz4st7: alreadygot=(0, 3), allocated=() 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer b3llgpww: alreadygot=(2,), allocated=(1,) 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer rvsry4kn: alreadygot=(1,), allocated=(0,) 2010-07-18 09:54:15-0600 [-] server selection unsuccessful for <Tahoe2PeerSelector for upload t2uvf>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers.: placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error): sh0: rvsry4kn, sh1: b3llgpww+rvsry4kn, sh2: b3llgpww, sh3: ysbz4st7+xgru5adv
Attachments (1)
Change History (24)
comment:1 follow-up: ↓ 2 Changed at 2010-07-18T19:54:59Z by davidsarah
comment:2 in reply to: ↑ 1 Changed at 2010-07-18T20:07:57Z by davidsarah
Replying to davidsarah:
Is it the maximum matching implementation that is incorrect? If it is, then it should be possible to write a narrower test case for that.
That isn't the problem:
>>> from allmydata.util import happinessutil >>> sharemap = {0: set([0, 1, 2, 3]), 1: set([0, 3]), 2: set([1]), 3: set([2])} >>> happinessutil.servers_of_happiness(sharemap) 4
comment:3 Changed at 2010-07-18T21:30:44Z by zooko
Swapping the order of the last two servers in the test makes it so the code under test starts succeeding to upload instead of failing to upload. This fails:
def test_problem_layout_0123_03_1_2(self): # Zooko stumbled on this case when investigating #1118. self.basedir = self.mktemp() d = self._setup_and_upload(k=2, n=4) # server 0: shares 0, 1, 2, 3 # server 1: shares 0, 3 # server 2: share 1 # server 3: share 2 # With this layout, an upload should just be satisfied that the current distribution is good enough, right? def _setup(ign): self._add_server_with_share(server_number=0, share_number=None) self._add_server_with_share(server_number=1, share_number=0) self._add_server_with_share(server_number=2, share_number=1) self._add_server_with_share(server_number=3, share_number=2) # Copy shares self._copy_share_to_server(3, 1) client = self.g.clients[0] client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4 return client d.addCallback(_setup) d.addCallback(lambda client: client.upload(upload.Data("data" * 10000, convergence=""))) return d
while logging the following:
$ grep -Ee"response|success" _trial_temp/test.log 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(0,) 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(1, 2, 3) 2010-07-18 15:16:55-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 1 servers such that any 2 of them have enough shares to recover the file, sent 2 queries to 1 peers, 2 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error) 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=() 2010-07-18 15:16:55-0600 [-] response from peer ysbz4st7: alreadygot=(0, 3), allocated=() 2010-07-18 15:16:55-0600 [-] response from peer b3llgpww: alreadygot=(2,), allocated=(1,) 2010-07-18 15:16:55-0600 [-] response from peer rvsry4kn: alreadygot=(1,), allocated=(0,)
This succeeds:
def test_problem_layout_0123_03_2_1(self): # Zooko stumbled on this case when investigating #1118. self.basedir = self.mktemp() d = self._setup_and_upload(k=2, n=4) # server 0: shares 0, 1, 2, 3 # server 1: shares 0, 3 # server 2: share 2 # server 3: share 1 # With this layout, an upload should just be satisfied that the current distribution is good enough, right? def _setup(ign): self._add_server_with_share(server_number=0, share_number=None) self._add_server_with_share(server_number=1, share_number=0) self._add_server_with_share(server_number=2, share_number=2) self._add_server_with_share(server_number=3, share_number=1) # Copy shares self._copy_share_to_server(3, 1) client = self.g.clients[0] client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4 return client d.addCallback(_setup) d.addCallback(lambda client: client.upload(upload.Data("data" * 10000, convergence=""))) return d
while logging the following:
$ grep -Ee"response|success" _trial_temp/test.log 2010-07-18 15:16:01-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(0,) 2010-07-18 15:16:01-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(1, 2, 3) 2010-07-18 15:16:01-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 1 servers such that any 2 of them have enough shares to recover the file, sent 2 queries to 1 peers, 2 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error) 2010-07-18 15:16:02-0600 [-] response from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=() 2010-07-18 15:16:02-0600 [-] response from peer ysbz4st7: alreadygot=(0, 3), allocated=() 2010-07-18 15:16:02-0600 [-] response from peer b3llgpww: alreadygot=(1,), allocated=() 2010-07-18 15:16:02-0600 [-] response from peer rvsry4kn: alreadygot=(2,), allocated=() 2010-07-18 15:16:02-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)
Changed at 2010-07-18T22:47:38Z by zooko
comment:4 Changed at 2010-07-18T22:48:01Z by zooko
attachment:test-1124.dpatch.txt is a test for this issue, marked TODO.
comment:5 Changed at 2010-07-18T23:45:15Z by davidsarah
The problem seems to be the share redistribution algorithm implemented by Tahoe2PeerSelector (as modified by #778). If we instrument it to print out the sharemap (i.e. sharenum -> set of peerids) after the call to merge_peers in each iteration of _loop, we get (abbreviating each peerid to its first hex byte):
{0: set(['b9']), 1: set(['b9']), 2: set(['b9']), 3: set(['b9'])} {0: set(['c4']), 1: set(['0e']), 2: set(['0e']), 3: set(['c4', 'b9'])} {0: set(['8d']), 1: set(['0e', '8d']), 2: set(['0e']), 3: set(['c4', 'b9'])}
Presumably 'b9' is server 0. So we merge in all of the shares found for that server, but then in the same iteration of _loop, we move all but one of them into homeless_shares. I don't understand the intent of the algorithm fully, but it seems as though we're forgetting that server 0 had additional shares that if taken into account, would have increased servers_of_happiness -- even though they didn't increase it in that particular iteration.
comment:6 Changed at 2010-07-18T23:46:50Z by davidsarah
- Keywords reviewed added
comment:7 Changed at 2010-07-19T02:48:38Z by kevan
The intent of the algorithm is to identify servers with more than one share, and make some of the shares on those servers homeless so that they can be redistributed to peers that might not have had any shares assigned to them yet. It is a greedy algorithm that doesn't quite do the trick in a lot of situations, and it seems like this is one of them. test_problem_layout_comment_187 is another such layout; it is marked as todo, because we hope to change the uploader to do a better job of share redistribution in 1.8. This might be a feature of #1126, or might be another ticket that hasn't been made yet.
comment:8 Changed at 2010-07-21T16:00:28Z by zooko
There was a bug in the code that david-sarah was testing in comment:5. That bug was fixed by 13b5e44fbc2effd0. We should re-evaluate this ticket.
comment:9 Changed at 2010-07-24T00:27:33Z by davidsarah
- Keywords availability reliability added; reviewed removed
Test was committed in 0f46766a51792eb5.
comment:10 Changed at 2010-08-12T23:33:50Z by davidsarah
- Milestone changed from undecided to 1.9.0
comment:11 Changed at 2010-08-12T23:34:48Z by davidsarah
- Keywords unfinished-business added
comment:12 Changed at 2010-12-29T09:15:58Z by zooko
- Keywords servers-of-happiness added
comment:13 follow-up: ↓ 14 Changed at 2011-08-29T15:33:15Z by warner
- Milestone changed from 1.9.0 to 1.10.0
what needs to happen to resolve this? do we have a plan to improve the share-distribution algorithm? It seems to me that there's no chance of this being a small safe fix, so I'm kicking it out of 1.9 .
comment:14 in reply to: ↑ 13 Changed at 2011-08-30T23:15:04Z by davidsarah
Replying to warner:
what needs to happen to resolve this? do we have a plan to improve the share-distribution algorithm?
That was being done in #1382. If I understand correctly, Kevan's patches there try to implement the algorithm of ticket:1130#comment:12, or something very much like it.
comment:15 Changed at 2013-05-10T20:31:56Z by markberger
- Cc mjberger@… added
- Description modified (diff)
comment:16 Changed at 2013-09-01T05:29:51Z by daira
- Milestone changed from soon to 1.11.0
comment:17 Changed at 2014-04-29T22:46:37Z by zooko
I believe this will be fixed by #1382.
comment:18 Changed at 2015-01-20T17:25:53Z by warner
- Milestone changed from 1.10.1 to 1.11.0
comment:19 Changed at 2015-02-09T10:55:42Z by lpirl
- Cc tahoe-lafs.org@… added
comment:20 Changed at 2016-03-22T05:02:52Z by warner
- Milestone changed from 1.11.0 to 1.12.0
Milestone renamed
comment:21 Changed at 2016-06-28T18:20:37Z by warner
- Milestone changed from 1.12.0 to 1.13.0
moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders
comment:22 Changed at 2020-06-30T14:45:13Z by exarkun
- Milestone changed from 1.13.0 to 1.15.0
Moving open issues out of closed milestones.
comment:23 Changed at 2021-03-30T18:40:19Z by meejah
- Milestone changed from 1.15.0 to soon
Ticket retargeted after milestone closed
(Nitpick: the name of the test method should be 0123_03_1_2.)
Is it the maximum matching implementation that is incorrect? If it is, then it should be possible to write a narrower test case for that.