[tahoe-dev] Manual rebalancing in 1.10.0?

Sun Sep 22 16:46:24 UTC 2013

Mark Berger, et al,

I (believe I have) tried my scenario with your code, and it doesn't fix 
the behavior I have been seeing.

Given a file on the grid for which all shares exist, but which needs 
rebalancing, "tahoe put" for that same file will fail.  (And "tahoe 
check --repair" does not attempt to rebalance.)

This is what I did.  I'm a git novice, so maybe I didn't get the right code:
$ git clone https://github.com/markberger/tahoe-lafs.git
$ cd tahoe-lafs/
$ git checkout 1382-rewrite
Branch 1382-rewrite set up to track remote branch 1382-rewrite from origin.
Switched to a new branch '1382-rewrite'
$ python setup.py build
<snip>
$ bin/tahoe --version
allmydata-tahoe: 1.10b1.post68 [1382-rewrite: 
7b95f937089d59b595dfe5e85d2d81ec36d5cf9d]
foolscap: 0.6.4
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
zfec: 1.4.24
Twisted: 13.0.0
Nevow: 0.10.0
zope.interface: unknown
python: 2.7.3
platform: OpenBSD-5.3-amd64-64bit
pyOpenSSL: 0.13
simplejson: 3.3.0
pycrypto: 2.6
pyasn1: 0.1.7
mock: 1.0.1
setuptools: 0.6c16dev4

Original output from tahoe check --raw:
{
  "results": {
   "needs-rebalancing": true,
   "count-unrecoverable-versions": 0,
   "count-good-share-hosts": 2,
   "count-shares-good": 10,
   "count-corrupt-shares": 0,
   "list-corrupt-shares": [],
   "count-shares-expected": 10,
   "healthy": true,
   "count-shares-needed": 4,
   "sharemap": {
    "0": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "1": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "2": [
     "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa"
    ],
    "3": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "4": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "5": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "6": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "7": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "8": [
     "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa"
    ],
    "9": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ]
   },
   "count-recoverable-versions": 1,
   "count-wrong-shares": 0,
   "servers-responding": [
    "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya",
    "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa",
    "v0-jqs2izy4yo2wusmsso2mzkfqpqrmmbhegtxcyup7heisfrf4octa",
    "v0-rbwrud2e6alixe4xwlaynv7jbzvhn2wxbs4jniqlgu6wd5sk724q"
   ],
   "recoverable": true
  },
  "storage-index": "rfomclj5ogk434v2gchspipv3i",
  "summary": "Healthy"
}

Then I try to re-upload the unbalanced file:
$ bin/tahoe put /tmp/temp_file

Error: 500 Internal Server Error
Traceback (most recent call last):
   File 
"/usr/local/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py", 
line 677, in _done
     self.request.complete(res)
   File 
"/usr/local/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py", 
line 60, in complete
     self.deferred.callback(res)
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.3-amd64.egg/twisted/internet/defer.py", 
line 380, in callback
     self._startRunCallbacks(result)
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.3-amd64.egg/twisted/internet/defer.py", 
line 488, in _startRunCallbacks
     self._runCallbacks()
--- <exception caught here> ---
   File 
"/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.3-amd64.egg/twisted/internet/defer.py", 
line 575, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 604, in _got_response
     return self._loop()
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 455, in _loop
     return self._failed("%s (%s)" % (failmsg, 
self._get_progress_message()))
   File 
"/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", 
line 617, in _failed
     raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or 
found on 4 server(s), but they are not spread out evenly enough to 
ensure that any 4 of these servers would have enough shares to recover 
the file. We were asked to place shares on at least 4 servers such that 
any 4 of them have enough shares to recover the file. (placed all 10 
shares, want to place shares on at least 4 servers such that any 4 of 
them have enough shares to recover the file, sent 4 queries to 4 
servers, 4 queries placed some shares, 0 placed none (of which 0 placed 
none due to the server being full and 0 placed none due to an error))

On 09/17/13 08:45, Kyle Markley wrote:
> It would be my pleasure.  But I won't have time to do it until the 
> weekend.
>
> It might be faster, and all-around better, to create a unit test that 
> exercises the scenario in my original message.  Then my buildbot 
> (which has way more free time than I do) can try it for me.
>
> Incidentally, I understand how I created that scenario.  The machine 
> that had all the shares is always on, and runs  deep-check --repair 
> crons.  My other machines aren't reliably on the grid, so after 
> repeated repair operations, the always-on machine tends to get a lot 
> of shares.  Eventually, it accumulated shares.needed, and then a 
> repair happened while it was the only machine on the grid.  Because 
> repair didn't care about shares.happy, this machine got all 
> shares.total shares.  Then, because an upload cares about shares.happy 
> but wouldn't rebalance, it had to fail.
>
> A grid whose nodes don't have similar uptime is surprisingly fragile.  
> Failure of that single always-on machine makes the file totally 
> unretrievable, definitely not the desired behavior.
>
>
>
> On 09/16/13 09:57, Zooko O'Whielacronx wrote:
>> Dear Kyle:
>>
>> Could you try Mark Berger's #1382 patch on your home grid and tell us
>> if it fixes the problem?
>>
>> https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1382# immutable peer
>> selection refactoring and enhancements
>>
>> https://github.com/tahoe-lafs/tahoe-lafs/pull/60
>>
>> Regards,
>>
>> Zooko
>> _______________________________________________
>> tahoe-dev mailing list
>> tahoe-dev at tahoe-lafs.org
>> https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
>
>

-- 
Kyle Markley