Opened at 2012-07-06T23:32:42Z
Last modified at 2016-08-05T22:02:46Z
#1791 new defect
UploadUnhappinessError with available storage nodes > shares.happy
Reported by: | gyver | Owned by: | gyver |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-peerselection | Version: | 1.9.2 |
Keywords: | servers-of-happiness upload error | Cc: | zooko, vladimir@… |
Launchpad Bug: |
Description (last modified by daira)
The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't).
There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with shares.needed = 4 shares.happy = 6 shares.total = 6
Although 7 >= 6 I get this error when trying to tahoe cp a new file:
Traceback (most recent call last): File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677, in _done self.request.complete(res) File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60, in complete self.deferred.callback(res) File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 361, in callback self._startRunCallbacks(result) File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 553, in _got_response return self._loop() File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 404, in _loop return self._failed(\"%s (%s)\" % (failmsg, self._get_progress_message())) File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 566, in _failed raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))
I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be):
01:04:01.314 L20 []#2339 CHKUploader starting 01:04:01.314 L20 []#2340 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8> 01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage index> 01:04:01.314 L20 []#2342 file size: 4669394 01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072) 01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072 01:04:01.314 L20 []#2345 now setting up codec 01:04:01.348 L20 []#2346 using storage index k5ga2 01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting 01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=() 01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,) 01:04:01.375 L20 []#2350 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe 01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=() 01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=() 01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,) 01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu 01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644
Change History (20)
comment:1 Changed at 2012-07-07T12:55:47Z by davidsarah
comment:2 Changed at 2012-07-07T13:38:21Z by gyver
Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns:
- starts the introducer,
- starts each storage node.
If one of the expected node can't be started I see it right away in the script output when starting the grid. At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes. I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data.
I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design). But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer.
Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything):
Introducer's Service Announcements:
Nickname | Advertised IPs | Announced | Version | Service name |
---|---|---|---|---|
store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | 192.168.0.1 10.8.0.10 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_2 d3fycappvcjuma7vwc45spfolaei6n24 | 192.168.0.1 10.8.0.10 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv | 192.168.0.1 10.8.0.10 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_4 pa2myijhnj3ld4owmqxsihow4hv2botv | 192.168.0.1 10.8.0.10 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_5 wo6akhxtren7esuhbtjscmpzvjwuvken | 192.168.0.1 10.8.0.10 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | 192.168.0.1 10.8.0.10 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr | 192.168.0.1 10.8.0.10 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
Introducer's Subscribed Clients:
Nickname | PeerID | Advertised IPs | Connected From | Since | Version | Subscribed To |
---|---|---|---|---|---|---|
store_1 | omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | 192.168.0.1 10.8.0.10 | 192.168.0.1:54872 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_5 | wo6akhxtren7esuhbtjscmpzvjwuvken | 192.168.0.1 10.8.0.10 | 192.168.0.1:54949 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_2 | d3fycappvcjuma7vwc45spfolaei6n24 | 192.168.0.1 10.8.0.10 | 192.168.0.1:54884 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_3 | lehmccp7am6cydjh3qsfmdp3scxzprmv | 192.168.0.1 10.8.0.10 | 127.0.0.1:55124 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_4 | pa2myijhnj3ld4owmqxsihow4hv2botv | 192.168.0.1 10.8.0.10 | 127.0.0.1:55143 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_6 | zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | 192.168.0.1 10.8.0.10 | 127.0.0.1:55207 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
store_8 | ughwvrtu7ht6x4xpgr2abbsb457ipcvr | 192.168.0.1 10.8.0.10 | 127.0.0.1:55252 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 | storage |
"The storage node I use as a client"'s status:
Connected to 7 of 7 known storage servers:
Service Name | Nickname PeerID | Connected? | Since | First Announced | Version |
---|---|---|---|---|---|
storage | store_2 d3fycappvcjuma7vwc45spfolaei6n24 | Yes: to 127.0.0.1:48397 | 00:45:25 07-Jul-2012 | 00:45:25 07-Jul-2012 | allmydata-tahoe/1.9.2 |
storage | store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv | Yes: to 192.168.0.1:38708 | 00:45:26 07-Jul-2012 | 00:45:26 07-Jul-2012 | allmydata-tahoe/1.9.2 |
storage | store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv | Yes: to (loopback) | 00:45:24 07-Jul-2012 | 00:45:24 07-Jul-2012 | allmydata-tahoe/1.9.2 |
storage | store_4 pa2myijhnj3ld4owmqxsihow4hv2botv | Yes: to 10.8.0.10:40626 | 00:45:27 07-Jul-2012 | 00:45:27 07-Jul-2012 | allmydata-tahoe/1.9.2 |
storage | store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr | Yes: to 10.8.0.10:49604 | 00:45:31 07-Jul-2012 | 00:45:31 07-Jul-2012 | allmydata-tahoe/1.9.2 |
storage | store_5 wo6akhxtren7esuhbtjscmpzvjwuvken | Yes: to 127.0.0.1:50958 | 00:45:29 07-Jul-2012 | 00:45:29 07-Jul-2012 | allmydata-tahoe/1.9.2 |
storage | store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr | Yes: to 10.8.0.10:52661 | 00:45:30 07-Jul-2012 | 00:45:30 07-Jul-2012 | allmydata-tahoe/1.9.2 |
I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.
comment:3 follow-up: ↓ 5 Changed at 2012-07-07T14:59:54Z by davidsarah
Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.
I don't know why the determination of which IP to use is nondeterministic. warner?
comment:4 Changed at 2012-07-07T15:02:02Z by davidsarah
BTW, I have seen nondeterministic choice of IP on the Least Authority Enterprises servers (which are EC2 instances running Ubuntu) as well.
comment:5 in reply to: ↑ 3 Changed at 2012-07-07T15:11:42Z by gyver
Replying to davidsarah:
Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.
Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared.
The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.
comment:6 follow-up: ↓ 9 Changed at 2012-07-07T19:38:30Z by davidsarah
Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of src/allmydata/immutable/upload.py in 1.9.2:
self.log(format="servers before upload: %(connected_servers)r permuted to %(all_servers)r, of which" "%(writeable_servers)r are writeable and %(readonly_servers)r are read-only", connected_servers=storage_broker.get_connected_servers(), all_servers=all_servers, writeable_servers=writeable_servers, readonly_servers=readonly_servers, level=log.OPERATIONAL)
and then show the log for a failing upload.
(You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)
comment:7 Changed at 2012-07-07T19:40:15Z by davidsarah
- Component changed from unknown to code-peerselection
- Milestone changed from undecided to 1.10.0
- Owner changed from davidsarah to gyver
comment:8 Changed at 2012-07-07T19:40:38Z by davidsarah
- Keywords happiness upload error added; happy removed
comment:9 in reply to: ↑ 6 Changed at 2012-07-07T21:30:23Z by gyver
Replying to davidsarah:
Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of src/allmydata/immutable/upload.py in 1.9.2:
I may not have done it right : I got the same output with this at the end:
23:09:02.238 L23 []#2436 an outbound callRemote (that we [omkz] sent to someone else [zqxq]) failed on the far end 23:09:02.238 L10 []#2437 reqID=873, rref=<RemoteReference at 0x2e780d0>, methname=RILogObserver.foolscap.lothar.com.msg 23:09:02.238 L10 []#2438 the REMOTE failure was: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/foolscap/slicers/root.py", line 107, in send d.callback(None) File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback self._startRunCallbacks(result) File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 215, in produce slicer = self.newSlicerFor(obj) File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 314, in newSlicerFor return topSlicer.slicerForObject(obj) File "/usr/lib64/python2.7/site-packages/foolscap/slicer.py", line 48,
BUT... I may have a lead looking at the last error message in my original log dump.
server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
I assume the sh<n> are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case).
What if these shares are already stored on pa2myijh before the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.
comment:10 follow-up: ↓ 18 Changed at 2012-07-08T00:47:59Z by davidsarah
- Keywords servers-of-happiness added; happiness removed
There was a bug in the statement I asked you to add; please replace it entirely with this one:
self.log(format="servers before upload: %(connected_servers)s permuted to %(all_servers)s, of which" "%(writeable_servers)s are writeable and %(readonly_servers)s are read-only", connected_servers=repr(storage_broker.get_connected_servers()), all_servers=repr(all_servers), writeable_servers=repr(writeable_servers), readonly_servers=repr(readonly_servers), level=log.OPERATIONAL)
In answer to your last question, the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.
comment:11 follow-up: ↓ 12 Changed at 2012-07-08T00:52:25Z by davidsarah
I'm hoping that this bug is the same one that has been occasionally reported on VolunteerGrid2 with uploads where shares.happy is close to shares.total (and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.
comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed at 2012-07-08T01:12:29Z by gyver
Replying to davidsarah:
[...] but gyver seems to be able to reproduce it more easily.
You bet: I take more time firing ssh and joining the screen session that the tahoe cp takes to fail :-)
Here's the log with a bit of context:
03:09:17.432 L20 []#2487 CHKUploader starting 03:09:17.432 L20 []#2488 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2fe8998> 03:09:17.432 L20 []#2489 creating Encoder <Encoder for unknown storage index> 03:09:17.432 L20 []#2490 file size: 4669394 03:09:17.432 L10 []#2491 my encoding parameters: (4, 6, 6, 131072) 03:09:17.432 L20 []#2492 got encoding parameters: 4/6/6 131072 03:09:17.432 L20 []#2493 now setting up codec 03:09:17.468 L20 []#2494 using storage index k5ga2 03:09:17.468 L20 []#2495 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting 03:09:17.469 L20 []#2496 servers before upload: frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], of which[<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] are writeable and set([]) are read-only 03:09:17.485 L10 []#2497 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=() 03:09:17.497 L10 []#2498 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,) 03:09:17.500 L20 []#2499 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe 03:09:17.515 L10 []#2500 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=() 03:09:17.523 L10 []#2501 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=() 03:09:17.530 L10 []#2502 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,) 03:09:17.531 L25 []#2503 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu 03:09:17.533 L20 []#2504 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644
comment:13 in reply to: ↑ 12 Changed at 2012-07-08T16:44:48Z by davidsarah
Replying to gyver:
> 03:09:17.469 L20 []#2496 servers before upload: > frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) > permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], > of which [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] > are writeable and set([]) are read-only
OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only responses to remote allocate_buckets requests at the gateway, so the next steps are:
a) Log when the gateway sends an allocate_buckets request.
b) Look at the logs of the storage servers to see how many of them receive an allocate_buckets request (which is logged here as "storage: allocate_buckets <SI>") and what they do about it.
To do a), add this at line 105 of src/allmydata/immutable/upload.py (in the query method of ServerTracker after rref = self._server.get_rref()):
log.msg("query(%(sharenums)s) for %(rref)s", sharenums=repr(sharenums), rref=repr(rref), level=log.OPERATIONAL)
comment:14 Changed at 2012-07-09T18:26:31Z by zooko
- Cc zooko added
comment:15 Changed at 2012-09-04T16:59:28Z by warner
- Milestone changed from 1.10.0 to 1.11.0
comment:17 Changed at 2013-07-09T14:45:41Z by daira
From #2016 which has now been marked as a duplicate:
daira wrote:
Here's the most important part of the log:
local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=() local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,) local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=() local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=() local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns
daira wrote:
Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition.
If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.
zooko replied:
Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!
kapiteined wrote:
And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! Success: file copied
Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ?
daira wrote:
We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime.
daira wrote:
Same bug as #1791?
Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.
comment:18 in reply to: ↑ 10 Changed at 2013-07-09T14:55:05Z by daira
Replying to myself in comment:10:
In answer to your last question (gyver in comment:9), the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.
[Edit: more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the {{{UploadUnhappinessError}}} message suggests that it isn't.]
I was wrong here; it *is* quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.
comment:19 Changed at 2014-04-29T22:44:07Z by zooko
So, the branch from #1382 will fix this bug.
comment:20 Changed at 2016-08-05T22:02:46Z by rvs
- Cc vladimir@… added
Thanks for the log, that's very useful.
The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected?
Maybe we should include the set of connected servers in the log or the error message.