[tahoe-lafs-trac-stream] [tahoe-lafs] #1791: UploadUnhappinessError with available storage nodes > shares.happy

Tue Jul 9 14:32:38 UTC 2013

#1791: UploadUnhappinessError with available storage nodes > shares.happy
---------------------------+-----------------------------------------------
     Reporter:  gyver      |      Owner:  gyver
         Type:  defect     |     Status:  new
     Priority:  major      |  Milestone:  1.11.0
    Component:  code-      |    Version:  1.9.2
  peerselection            |   Keywords:  servers-of-happiness upload error
   Resolution:             |
Launchpad Bug:             |
---------------------------+-----------------------------------------------

Old description:

> The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed
> some files/dir that 1.9.1 couldn't repair reliably hoping the following
> problem would get away too (it didn't).
>
> There are some peculiarities in my setup: I use USB disks connected to a
> single server so all storage nodes are running on the same server
> although physically on a disk that can easily be sent away for increasing
> the durability of the whole storage. At the time of failure there were 7
> such storage nodes in my setup and my whole store was fully repaired on
> these 7 nodes, all the content is/was uploaded with
> shares.needed = 4
> shares.happy = 6
> shares.total = 6
>
> Although 7 >= 6 I get this error when trying to tahoe cp a new file:
>
> {{{
> Traceback (most recent call last):
>   File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677,
> in _done
>     self.request.complete(res)
>   File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60,
> in complete
>     self.deferred.callback(res)
>   File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\",
> line 361, in callback
>     self._startRunCallbacks(result)
>   File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\",
> line 455, in _startRunCallbacks
>     self._runCallbacks()
> --- <exception caught here> ---
>   File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\",
> line 542, in _runCallbacks
>     current.result = callback(current.result, *args, **kw)
>   File \"/usr/lib64/python2.7/site-
> packages/allmydata/immutable/upload.py\", line 553, in _got_response
>     return self._loop()
>   File \"/usr/lib64/python2.7/site-
> packages/allmydata/immutable/upload.py\", line 404, in _loop
>     return self._failed(\"%s (%s)\" % (failmsg,
> self._get_progress_message()))
>   File \"/usr/lib64/python2.7/site-
> packages/allmydata/immutable/upload.py\", line 566, in _failed
>     raise UploadUnhappinessError(msg)
> allmydata.interfaces.UploadUnhappinessError: shares could be placed on
> only 5 server(s) such that any 4 of them have enough shares to recover
> the file, but we were asked to place shares on at least 6 such servers.
> (placed all 6 shares, want to place shares on at least 6 servers such
> that any 4 of them have enough shares to recover the file, sent 5 queries
> to 5 servers, 5 queries placed some shares, 0 placed none (of which 0
> placed none due to the server being full and 0 placed none due to an
> error))
> }}}
>
> I recently found out about flogtool, so I run it on the client node
> (which is one of the 7 storage nodes btw), I only pasted the last part
> from CHKUploader (I can attach the whole log if needs be):
> {{{
> 01:04:01.314 L20 []#2339 CHKUploader starting
> 01:04:01.314 L20 []#2340 starting upload of
> <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8>
> 01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage
> index>
> 01:04:01.314 L20 []#2342 file size: 4669394
> 01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072)
> 01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072
> 01:04:01.314 L20 []#2345 now setting up codec
> 01:04:01.348 L20 []#2346 using storage index k5ga2
> 01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> starting
> 01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,),
> allocated=()
> 01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> response to allocate_buckets() from server pa2myijh: alreadygot=(2,),
> allocated=(1,)
> 01:04:01.375 L20 []#2350 storage: allocate_buckets
> k5ga2suaoz7gju523f5ni3mswe
> 01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> response to allocate_buckets() from server omkzwfx5: alreadygot=(3,),
> allocated=()
> 01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> response to allocate_buckets() from server wo6akhxt: alreadygot=(4,),
> allocated=()
> 01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> response to allocate_buckets() from server ughwvrtu: alreadygot=(),
> allocated=(5,)
> 01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
> server selection unsuccessful for <Tahoe2ServerSelector for upload
> k5ga2>: shares could be placed on only 5 server(s) such that any 4 of
> them have enough shares to recover the file, but we were asked to place
> shares on at least 6 such servers. (placed all 6 shares, want to place
> shares on at least 6 servers such that any 4 of them have enough shares
> to recover the file, sent 5 queries to 5 servers, 5 queries placed some
> shares, 0 placed none (of which 0 placed none due to the server being
> full and 0 placed none due to an error)), merged=sh0: zp6jpfeu, sh1:
> pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
> 01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644
> }}}

New description:

 The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some
 files/dir that 1.9.1 couldn't repair reliably hoping the following problem
 would get away too (it didn't).

 There are some peculiarities in my setup: I use USB disks connected to a
 single server so all storage nodes are running on the same server although
 physically on a disk that can easily be sent away for increasing the
 durability of the whole storage. At the time of failure there were 7 such
 storage nodes in my setup and my whole store was fully repaired on these 7
 nodes, all the content is/was uploaded with
 shares.needed = 4
 shares.happy = 6
 shares.total = 6

 Although 7 >= 6 I get this error when trying to tahoe cp a new file:

 {{{
 Traceback (most recent call last):
   File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 677,
 in _done
     self.request.complete(res)
   File \"/usr/lib64/python2.7/site-packages/foolscap/call.py\", line 60,
 in complete
     self.deferred.callback(res)
   File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\",
 line 361, in callback
     self._startRunCallbacks(result)
   File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\",
 line 455, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File \"/usr/lib64/python2.7/site-packages/twisted/internet/defer.py\",
 line 542, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File \"/usr/lib64/python2.7/site-
 packages/allmydata/immutable/upload.py\", line 553, in _got_response
     return self._loop()
   File \"/usr/lib64/python2.7/site-
 packages/allmydata/immutable/upload.py\", line 404, in _loop
     return self._failed(\"%s (%s)\" % (failmsg,
 self._get_progress_message()))
   File \"/usr/lib64/python2.7/site-
 packages/allmydata/immutable/upload.py\", line 566, in _failed
     raise UploadUnhappinessError(msg)
 allmydata.interfaces.UploadUnhappinessError: shares could be placed on
 only 5 server(s) such that any 4 of them have enough shares to recover the
 file, but we were asked to place shares on at least 6 such servers.
 (placed all 6 shares, want to place shares on at least 6 servers such that
 any 4 of them have enough shares to recover the file, sent 5 queries to 5
 servers, 5 queries placed some shares, 0 placed none (of which 0 placed
 none due to the server being full and 0 placed none due to an error))
 }}}

 I recently found out about flogtool, so I run it on the client node (which
 is one of the 7 storage nodes btw), I only pasted the last part from
 CHKUploader (I can attach the whole log if needs be):
 {{{
 01:04:01.314 L20 []#2339 CHKUploader starting
 01:04:01.314 L20 []#2340 starting upload of
 <allmydata.immutable.upload.EncryptAnUploadable instance at 0x2c9b5a8>
 01:04:01.314 L20 []#2341 creating Encoder <Encoder for unknown storage
 index>
 01:04:01.314 L20 []#2342 file size: 4669394
 01:04:01.314 L10 []#2343 my encoding parameters: (4, 6, 6, 131072)
 01:04:01.314 L20 []#2344 got encoding parameters: 4/6/6 131072
 01:04:01.314 L20 []#2345 now setting up codec
 01:04:01.348 L20 []#2346 using storage index k5ga2
 01:04:01.348 L20 []#2347 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 starting
 01:04:01.363 L10 []#2348 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 response to allocate_buckets() from server zp6jpfeu: alreadygot=(0,),
 allocated=()
 01:04:01.372 L10 []#2349 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 response to allocate_buckets() from server pa2myijh: alreadygot=(2,),
 allocated=(1,)
 01:04:01.375 L20 []#2350 storage: allocate_buckets
 k5ga2suaoz7gju523f5ni3mswe
 01:04:01.377 L10 []#2351 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 response to allocate_buckets() from server omkzwfx5: alreadygot=(3,),
 allocated=()
 01:04:01.381 L10 []#2352 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 response to allocate_buckets() from server wo6akhxt: alreadygot=(4,),
 allocated=()
 01:04:01.404 L10 []#2353 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 response to allocate_buckets() from server ughwvrtu: alreadygot=(),
 allocated=(5,)
 01:04:01.405 L25 []#2354 <Tahoe2ServerSelector for upload k5ga2>(k5ga2):
 server selection unsuccessful for <Tahoe2ServerSelector for upload k5ga2>:
 shares could be placed on only 5 server(s) such that any 4 of them have
 enough shares to recover the file, but we were asked to place shares on at
 least 6 such servers. (placed all 6 shares, want to place shares on at
 least 6 servers such that any 4 of them have enough shares to recover the
 file, sent 5 queries to 5 servers, 5 queries placed some shares, 0 placed
 none (of which 0 placed none due to the server being full and 0 placed
 none due to an error)), merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2:
 pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
 01:04:01.407 L20 []#2355 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1644
 }}}

--

Comment (by daira):

 Same bug as #2016?

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1791#comment:16>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage