[tahoe-dev] Manual rebalancing in 1.10.0?
Mark Berger
mjberger at stanford.edu
Sat Sep 28 23:07:18 UTC 2013
Hi Kyle, sorry for not getting back to you. I just started school so I've
been pretty busy.
Tahoe considers a new file upload to be a special instance of file repair,
so my branch should address your issue. Currently file repair is necessary
in order to rebalance a file, but Brian has written an extensive
ticket<https://tahoe-lafs.org/trac/tahoe-lafs/ticket/543>on
rebalancing shares. However, I don't think anyone is actively working
on
it.
I think the first error you received is caused by two things:
1. While the original scope of 1382 is pretty large, my patch only
implements a new upload algorithm which will distributes shares
effectively. The checker doesn't upload anything unless an item needs to be
repaired. So you need to supply the "-repair" flag with the cli.
2. The checker does not consider the file to be unhealthy. Since the file
is "healthy", tahoe doesn't attempt to repair the file and rebalancing
doesn't occur. There is a separate ticket for this
issue<https://tahoe-lafs.org/trac/tahoe-lafs/ticket/614>,
but it hasn't been committed to trunk and it's not included in the 1382
branch.
To fix your problem you can manually repair the file from the webui using
my branch.
As for the UploadUnhappinessError, I'm not sure why that is happening. My
initial guess would be that you weren't connected to four servers at the
time you ran the command. Are you able to upload other files with my
branch? Are there any details about your grid that you think might have
caused the issue?
- Mark
On Sat, Sep 28, 2013 at 2:17 PM, Kyle Markley <kyle at arbyte.us> wrote:
> Reading the text of 1382, it isn't clear to me whether it's expected to
> address the error from "tahoe put" at all. (It only mentions check,
> verify, and repair.) And although it mentions repair, in my scenario all
> the shares are present on the grid, so the file doesn't need "repair" at
> all... it *only* needs rebalancing.
>
> Should I file my scenario in a new ticket? Or is it actually intended to
> be covered by 1382?
> Did I test the new code from 1382 correctly?
>
>
>
> On 09/22/13 09:46, Kyle Markley wrote:
>
>> Mark Berger, et al,
>>
>> I (believe I have) tried my scenario with your code, and it doesn't fix
>> the behavior I have been seeing.
>>
>> Given a file on the grid for which all shares exist, but which needs
>> rebalancing, "tahoe put" for that same file will fail. (And "tahoe check
>> --repair" does not attempt to rebalance.)
>>
>> This is what I did. I'm a git novice, so maybe I didn't get the right
>> code:
>> $ git clone https://github.com/markberger/**tahoe-lafs.git<https://github.com/markberger/tahoe-lafs.git>
>> $ cd tahoe-lafs/
>> $ git checkout 1382-rewrite
>> Branch 1382-rewrite set up to track remote branch 1382-rewrite from
>> origin.
>> Switched to a new branch '1382-rewrite'
>> $ python setup.py build
>> <snip>
>> $ bin/tahoe --version
>> allmydata-tahoe: 1.10b1.post68 [1382-rewrite:
>> 7b95f937089d59b595dfe5e85d2d81**ec36d5cf9d]
>> foolscap: 0.6.4
>> pycryptopp: 0.6.0.**120656932814151052564863480392**8199668821045408958
>> zfec: 1.4.24
>> Twisted: 13.0.0
>> Nevow: 0.10.0
>> zope.interface: unknown
>> python: 2.7.3
>> platform: OpenBSD-5.3-amd64-64bit
>> pyOpenSSL: 0.13
>> simplejson: 3.3.0
>> pycrypto: 2.6
>> pyasn1: 0.1.7
>> mock: 1.0.1
>> setuptools: 0.6c16dev4
>>
>>
>> Original output from tahoe check --raw:
>> {
>> "results": {
>> "needs-rebalancing": true,
>> "count-unrecoverable-versions"**: 0,
>> "count-good-share-hosts": 2,
>> "count-shares-good": 10,
>> "count-corrupt-shares": 0,
>> "list-corrupt-shares": [],
>> "count-shares-expected": 10,
>> "healthy": true,
>> "count-shares-needed": 4,
>> "sharemap": {
>> "0": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "1": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "2": [
>> "v0-**7ags2kynskk5rrmbyk6yzjzmceswxh**7x5lekghwsfbwdpfeaztxa"
>> ],
>> "3": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "4": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "5": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "6": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "7": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ],
>> "8": [
>> "v0-**7ags2kynskk5rrmbyk6yzjzmceswxh**7x5lekghwsfbwdpfeaztxa"
>> ],
>> "9": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya"
>> ]
>> },
>> "count-recoverable-versions": 1,
>> "count-wrong-shares": 0,
>> "servers-responding": [
>> "v0-**ylkbcys5oqliy26d6s6kuwk5nmw5kt**lcxmx254dfprm4rwrojhya",
>> "v0-**7ags2kynskk5rrmbyk6yzjzmceswxh**7x5lekghwsfbwdpfeaztxa",
>> "v0-**jqs2izy4yo2wusmsso2mzkfqpqrmmb**hegtxcyup7heisfrf4octa",
>> "v0-**rbwrud2e6alixe4xwlaynv7jbzvhn2**wxbs4jniqlgu6wd5sk724q"
>> ],
>> "recoverable": true
>> },
>> "storage-index": "rfomclj5ogk434v2gchspipv3i",
>> "summary": "Healthy"
>> }
>>
>>
>> Then I try to re-upload the unbalanced file:
>> $ bin/tahoe put /tmp/temp_file
>>
>> Error: 500 Internal Server Error
>> Traceback (most recent call last):
>> File "/usr/local/lib/python2.7/**site-packages/foolscap-0.6.4-**py2.7.egg/foolscap/call.py",
>> line 677, in _done
>> self.request.complete(res)
>> File "/usr/local/lib/python2.7/**site-packages/foolscap-0.6.4-**py2.7.egg/foolscap/call.py",
>> line 60, in complete
>> self.deferred.callback(res)
>> File "/usr/local/lib/python2.7/**site-packages/Twisted-13.0.0-**
>> py2.7-openbsd-5.3-amd64.egg/**twisted/internet/defer.py", line 380, in
>> callback
>> self._startRunCallbacks(**result)
>> File "/usr/local/lib/python2.7/**site-packages/Twisted-13.0.0-**
>> py2.7-openbsd-5.3-amd64.egg/**twisted/internet/defer.py", line 488, in
>> _startRunCallbacks
>> self._runCallbacks()
>> --- <exception caught here> ---
>> File "/usr/local/lib/python2.7/**site-packages/Twisted-13.0.0-**
>> py2.7-openbsd-5.3-amd64.egg/**twisted/internet/defer.py", line 575, in
>> _runCallbacks
>> current.result = callback(current.result, *args, **kw)
>> File "/usr/local/lib/python2.7/**site-packages/allmydata/**immutable/upload.py",
>> line 604, in _got_response
>> return self._loop()
>> File "/usr/local/lib/python2.7/**site-packages/allmydata/**immutable/upload.py",
>> line 455, in _loop
>> return self._failed("%s (%s)" % (failmsg,
>> self._get_progress_message()))
>> File "/usr/local/lib/python2.7/**site-packages/allmydata/**immutable/upload.py",
>> line 617, in _failed
>> raise UploadUnhappinessError(msg)
>> allmydata.interfaces.**UploadUnhappinessError: shares could be placed or
>> found on 4 server(s), but they are not spread out evenly enough to ensure
>> that any 4 of these servers would have enough shares to recover the file.
>> We were asked to place shares on at least 4 servers such that any 4 of them
>> have enough shares to recover the file. (placed all 10 shares, want to
>> place shares on at least 4 servers such that any 4 of them have enough
>> shares to recover the file, sent 4 queries to 4 servers, 4 queries placed
>> some shares, 0 placed none (of which 0 placed none due to the server being
>> full and 0 placed none due to an error))
>>
>>
>>
>>
>>
>> On 09/17/13 08:45, Kyle Markley wrote:
>>
>>> It would be my pleasure. But I won't have time to do it until the
>>> weekend.
>>>
>>> It might be faster, and all-around better, to create a unit test that
>>> exercises the scenario in my original message. Then my buildbot (which has
>>> way more free time than I do) can try it for me.
>>>
>>> Incidentally, I understand how I created that scenario. The machine
>>> that had all the shares is always on, and runs deep-check --repair crons.
>>> My other machines aren't reliably on the grid, so after repeated repair
>>> operations, the always-on machine tends to get a lot of shares.
>>> Eventually, it accumulated shares.needed, and then a repair happened while
>>> it was the only machine on the grid. Because repair didn't care about
>>> shares.happy, this machine got all shares.total shares. Then, because an
>>> upload cares about shares.happy but wouldn't rebalance, it had to fail.
>>>
>>> A grid whose nodes don't have similar uptime is surprisingly fragile.
>>> Failure of that single always-on machine makes the file totally
>>> unretrievable, definitely not the desired behavior.
>>>
>>>
>>>
>>> On 09/16/13 09:57, Zooko O'Whielacronx wrote:
>>>
>>>> Dear Kyle:
>>>>
>>>> Could you try Mark Berger's #1382 patch on your home grid and tell us
>>>> if it fixes the problem?
>>>>
>>>> https://tahoe-lafs.org/trac/**tahoe-lafs/ticket/1382#<https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1382#>immutable peer
>>>> selection refactoring and enhancements
>>>>
>>>> https://github.com/tahoe-lafs/**tahoe-lafs/pull/60<https://github.com/tahoe-lafs/tahoe-lafs/pull/60>
>>>>
>>>> Regards,
>>>>
>>>> Zooko
>>>> ______________________________**_________________
>>>> tahoe-dev mailing list
>>>> tahoe-dev at tahoe-lafs.org
>>>> https://tahoe-lafs.org/cgi-**bin/mailman/listinfo/tahoe-dev<https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev>
>>>>
>>>
>>>
>>>
>>
>>
>
> --
> Kyle Markley
>
> ______________________________**_________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> https://tahoe-lafs.org/cgi-**bin/mailman/listinfo/tahoe-dev<https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20130928/db0e7742/attachment-0001.html>
More information about the tahoe-dev
mailing list