Context Navigation

← Previous Ticket
Next Ticket →

Opened at 2012-07-06T23:32:42Z

Last modified at 2016-08-05T22:02:46Z

#1791 new defect

UploadUnhappinessError with available storage nodes > shares.happy

Reported by:	gyver	Owned by:	gyver
Priority:	major	Milestone:	soon
Component:	code-peerselection	Version:	1.9.2
Keywords:	servers-of-happiness upload error	Cc:	zooko, vladimir@…
Launchpad Bug:

Description (last modified by daira)

The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't).

There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with shares.needed = 4 shares.happy = 6 shares.total = 6

Although 7 >= 6 I get this error when trying to tahoe cp a new file:

Traceback (most recent call last):
  File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677, in _done
    self.request.complete(res)
  File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60, in complete
    self.deferred.callback(res)
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 361, in callback
    self._startRunCallbacks(result)
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 553, in _got_response
    return self._loop()
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 404, in _loop
    return self._failed(\"%s (%s)\" % (failmsg, self._get_progress_message()))
  File \"/usr/lib64/python2.7/site-packages/allmydata/immutable/upload.py\", line 566, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be):

01:04:01.314 L20 []#2339 CHKUploader starting
01:04:01.314 L20 []#2340 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8>
01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage index>
01:04:01.314 L20 []#2342 file size: 4669394
01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072)
01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072
01:04:01.314 L20 []#2345 now setting up codec
01:04:01.348 L20 []#2346 using storage index k5ga2
01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting
01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=()
01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,)
01:04:01.375 L20 []#2350 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe
01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=()
01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=()
01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,)
01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644

Change History (20)

comment:1 Changed at 2012-07-07T12:55:47Z by davidsarah

Thanks for the log, that's very useful.

The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected?

Maybe we should include the set of connected servers in the log or the error message.

comment:2 Changed at 2012-07-07T13:38:21Z by gyver

Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns:

starts the introducer,
starts each storage node.

If one of the expected node can't be started I see it right away in the script output when starting the grid. At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes. I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data.

I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design). But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer.

Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything):

Introducer's Service Announcements:

Nickname	Advertised IPs	Announced	Version	Service name
store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv	192.168.0.1 10.8.0.10	00:45:24 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_2 d3fycappvcjuma7vwc45spfolaei6n24	192.168.0.1 10.8.0.10	00:45:25 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv	192.168.0.1 10.8.0.10	00:45:26 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_4 pa2myijhnj3ld4owmqxsihow4hv2botv	192.168.0.1 10.8.0.10	00:45:27 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_5 wo6akhxtren7esuhbtjscmpzvjwuvken	192.168.0.1 10.8.0.10	00:45:29 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr	192.168.0.1 10.8.0.10	00:45:30 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr	192.168.0.1 10.8.0.10	00:45:31 07-Jul-2012	allmydata-tahoe/1.9.2	storage

Introducer's Subscribed Clients:

Nickname	PeerID	Advertised IPs	Connected From	Since	Version	Subscribed To
store_1	omkzwfx5fconrg4fjqxwt3hdzsmvhfdv	192.168.0.1 10.8.0.10	192.168.0.1:54872	00:45:24 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_5	wo6akhxtren7esuhbtjscmpzvjwuvken	192.168.0.1 10.8.0.10	192.168.0.1:54949	00:45:29 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_2	d3fycappvcjuma7vwc45spfolaei6n24	192.168.0.1 10.8.0.10	192.168.0.1:54884	00:45:25 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_3	lehmccp7am6cydjh3qsfmdp3scxzprmv	192.168.0.1 10.8.0.10	127.0.0.1:55124	00:45:26 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_4	pa2myijhnj3ld4owmqxsihow4hv2botv	192.168.0.1 10.8.0.10	127.0.0.1:55143	00:45:27 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_6	zp6jpfeudsrxv3n2sz6he3kep3mw3tgr	192.168.0.1 10.8.0.10	127.0.0.1:55207	00:45:30 07-Jul-2012	allmydata-tahoe/1.9.2	storage
store_8	ughwvrtu7ht6x4xpgr2abbsb457ipcvr	192.168.0.1 10.8.0.10	127.0.0.1:55252	00:45:31 07-Jul-2012	allmydata-tahoe/1.9.2	storage

"The storage node I use as a client"'s status:

Connected to 7 of 7 known storage servers:

Service Name	Nickname PeerID	Connected?	Since	First Announced	Version
storage	store_2 d3fycappvcjuma7vwc45spfolaei6n24	Yes: to 127.0.0.1:48397	00:45:25 07-Jul-2012	00:45:25 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_3 lehmccp7am6cydjh3qsfmdp3scxzprmv	Yes: to 192.168.0.1:38708	00:45:26 07-Jul-2012	00:45:26 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_1 omkzwfx5fconrg4fjqxwt3hdzsmvhfdv	Yes: to (loopback)	00:45:24 07-Jul-2012	00:45:24 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_4 pa2myijhnj3ld4owmqxsihow4hv2botv	Yes: to 10.8.0.10:40626	00:45:27 07-Jul-2012	00:45:27 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_8 ughwvrtu7ht6x4xpgr2abbsb457ipcvr	Yes: to 10.8.0.10:49604	00:45:31 07-Jul-2012	00:45:31 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_5 wo6akhxtren7esuhbtjscmpzvjwuvken	Yes: to 127.0.0.1:50958	00:45:29 07-Jul-2012	00:45:29 07-Jul-2012	allmydata-tahoe/1.9.2
storage	store_6 zp6jpfeudsrxv3n2sz6he3kep3mw3tgr	Yes: to 10.8.0.10:52661	00:45:30 07-Jul-2012	00:45:30 07-Jul-2012	allmydata-tahoe/1.9.2

I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.

comment:3 follow-up: ↓ 5 Changed at 2012-07-07T14:59:54Z by davidsarah

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.

I don't know why the determination of which IP to use is nondeterministic. warner?

Last edited at 2012-07-07T15:01:00Z by davidsarah (previous) (diff)

comment:4 Changed at 2012-07-07T15:02:02Z by davidsarah

BTW, I have seen nondeterministic choice of IP on the Least Authority Enterprises servers (which are EC2 instances running Ubuntu) as well.

comment:5 in reply to: ↑ 3 Changed at 2012-07-07T15:11:42Z by gyver

Replying to davidsarah:

Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.

Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared.

The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.

comment:6 follow-up: ↓ 9 Changed at 2012-07-07T19:38:30Z by davidsarah

Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of src/allmydata/immutable/upload.py in 1.9.2:

        self.log(format="servers before upload: %(connected_servers)r permuted to %(all_servers)r, of which"
                        "%(writeable_servers)r are writeable and %(readonly_servers)r are read-only",
                 connected_servers=storage_broker.get_connected_servers(),
                 all_servers=all_servers,
                 writeable_servers=writeable_servers,
                 readonly_servers=readonly_servers,
                 level=log.OPERATIONAL)

and then show the log for a failing upload.

(You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)

Last edited at 2012-07-07T19:42:10Z by davidsarah (previous) (diff)

comment:7 Changed at 2012-07-07T19:40:15Z by davidsarah

Component changed from unknown to code-peerselection
Milestone changed from undecided to 1.10.0
Owner changed from davidsarah to gyver

comment:8 Changed at 2012-07-07T19:40:38Z by davidsarah

Keywords happiness upload error added; happy removed

comment:9 in reply to: ↑ 6 Changed at 2012-07-07T21:30:23Z by gyver

Replying to davidsarah:

Please add the following just after line 225 (i.e. after readonly_servers = ... and before # decide upon the renewal/cancel secrets...) of src/allmydata/immutable/upload.py in 1.9.2:

I may not have done it right : I got the same output with this at the end:

23:09:02.238 L23 []#2436 an outbound callRemote (that we [omkz] sent to someone else [zqxq]) failed on the far end
23:09:02.238 L10 []#2437  reqID=873, rref=<RemoteReference at 0x2e780d0>, methname=RILogObserver.foolscap.lothar.com.msg
23:09:02.238 L10 []#2438  the REMOTE failure was:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib64/python2.7/site-packages/foolscap/slicers/root.py", line 107, in send
     d.callback(None)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback
     self._startRunCallbacks(result)
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
 --- <exception caught here> ---
   File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 215, in produce
     slicer = self.newSlicerFor(obj)
   File "/usr/lib64/python2.7/site-packages/foolscap/banana.py", line 314, in newSlicerFor
     return topSlicer.slicerForObject(obj)
   File "/usr/lib64/python2.7/site-packages/foolscap/slicer.py", line 48,

BUT... I may have a lead looking at the last error message in my original log dump.

server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu

I assume the sh<n> are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case).

What if these shares are already stored on pa2myijh before the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.

comment:10 follow-up: ↓ 18 Changed at 2012-07-08T00:47:59Z by davidsarah

Keywords servers-of-happiness added; happiness removed

There was a bug in the statement I asked you to add; please replace it entirely with this one:

        self.log(format="servers before upload: %(connected_servers)s permuted to %(all_servers)s, of which"
                        "%(writeable_servers)s are writeable and %(readonly_servers)s are read-only",
                 connected_servers=repr(storage_broker.get_connected_servers()),
                 all_servers=repr(all_servers),
                 writeable_servers=repr(writeable_servers),
                 readonly_servers=repr(readonly_servers),
                 level=log.OPERATIONAL)

In answer to your last question, the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.

[Edit: more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the {{{UploadUnhappinessError}}} message suggests that it isn't.]

Last edited at 2012-07-08T16:50:10Z by davidsarah (previous) (diff)

comment:11 follow-up: ↓ 12 Changed at 2012-07-08T00:52:25Z by davidsarah

I'm hoping that this bug is the same one that has been occasionally reported on VolunteerGrid2 with uploads where shares.happy is close to shares.total (and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.

Last edited at 2012-07-08T00:53:34Z by davidsarah (previous) (diff)

comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed at 2012-07-08T01:12:29Z by gyver

Replying to davidsarah:

[...] but gyver seems to be able to reproduce it more easily.

You bet: I take more time firing ssh and joining the screen session that the tahoe cp takes to fail :-)

Here's the log with a bit of context:

03:09:17.432 L20 []#2487 CHKUploader starting
03:09:17.432 L20 []#2488 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2fe8998>
03:09:17.432 L20 []#2489 creating Encoder <Encoder for unknown storage index>
03:09:17.432 L20 []#2490 file size: 4669394
03:09:17.432 L10 []#2491 my encoding parameters: (4, 6, 6, 131072)
03:09:17.432 L20 []#2492 got encoding parameters: 4/6/6 131072
03:09:17.432 L20 []#2493 now setting up codec
03:09:17.468 L20 []#2494 using storage index k5ga2
03:09:17.468 L20 []#2495 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): starting
03:09:17.469 L20 []#2496 servers before upload: frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>]) permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>], of which[<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>] are writeable and set([]) are read-only
03:09:17.485 L10 []#2497 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,), allocated=()
03:09:17.497 L10 []#2498 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server pa2myijh: alreadygot=(2,), allocated=(1,)
03:09:17.500 L20 []#2499 storage: allocate_buckets k5ga2suaoz7gju523f5ni3mswe
03:09:17.515 L10 []#2500 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server omkzwfx5: alreadygot=(3,), allocated=()
03:09:17.523 L10 []#2501 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server wo6akhxt: alreadygot=(4,), allocated=()
03:09:17.530 L10 []#2502 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): response to allocate_buckets() from server ughwvrtu: alreadygot=(), allocated=(5,)
03:09:17.531 L25 []#2503 <Tahoe2ServerSelector for upload k5ga2>(k5ga2): server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>: shares could be placed on only 5 server(s) such that any 4 of them have enough shares to recover the file, but we were asked to place shares on at least 6 such servers. (placed all 6 shares, want to place shares on at least 6 servers such that any 4 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
03:09:17.533 L20 []#2504 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644

comment:13 in reply to: ↑ 12 Changed at 2012-07-08T16:44:48Z by davidsarah

Replying to gyver:

> 03:09:17.469 L20 []#2496 servers before upload:
> frozenset([<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for d3fycapp>])
> permuted to [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>],
> of which [<NativeStorageServer for zp6jpfeu>, <NativeStorageServer for pa2myijh>, <NativeStorageServer for omkzwfx5>, <NativeStorageServer for wo6akhxt>, <NativeStorageServer for ughwvrtu>, <NativeStorageServer for lehmccp7>, <NativeStorageServer for d3fycapp>]
> are writeable and set([]) are read-only

OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only responses to remote allocate_buckets requests at the gateway, so the next steps are:

a) Log when the gateway sends an allocate_buckets request.

b) Look at the logs of the storage servers to see how many of them receive an allocate_buckets request (which is logged here as "storage: allocate_buckets <SI>") and what they do about it.

To do a), add this at line 105 of src/allmydata/immutable/upload.py (in the query method of ServerTracker after rref = self._server.get_rref()):

        log.msg("query(%(sharenums)s) for %(rref)s", sharenums=repr(sharenums), rref=repr(rref), level=log.OPERATIONAL)

Last edited at 2012-09-04T16:55:17Z by zooko (previous) (diff)

comment:14 Changed at 2012-07-09T18:26:31Z by zooko

Cc zooko added

comment:15 Changed at 2012-09-04T16:59:28Z by warner

Milestone changed from 1.10.0 to 1.11.0

comment:16 Changed at 2013-07-09T14:32:38Z by daira

Description modified (diff)

Same bug as #2016?

comment:17 Changed at 2013-07-09T14:45:41Z by daira

From #2016 which has now been marked as a duplicate:

daira wrote:

Here's the most important part of the log:

local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=()
local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,)
local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=()
local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=()
local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>:
 shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers.
 (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)),
 merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns

daira wrote:

Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition.

If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.

zooko replied:

Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!

kapiteined wrote:

And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! Success: file copied

Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ?

daira wrote:

We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime.

daira wrote:

Same bug as #1791?

Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

comment:18 in reply to: ↑ 10 Changed at 2013-07-09T14:55:05Z by daira

Replying to myself in comment:10:

In answer to your last question (gyver in comment:9), the fact that there are existing shares should not cause an UploadUnhappinessError. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.

[Edit: more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the {{{UploadUnhappinessError}}} message suggests that it isn't.]

I was wrong here; it *is* quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.

Last edited at 2013-07-09T14:56:56Z by daira (previous) (diff)

comment:19 Changed at 2014-04-29T22:44:07Z by zooko

So, the branch from #1382 will fix this bug.

comment:20 Changed at 2016-08-05T22:02:46Z by rvs

Cc vladimir@… added

Note: See TracTickets for help on using tickets.

Download in other formats: