[tahoe-dev] Manual rebalancing in 1.10.0?

Sun Sep 1 06:47:37 UTC 2013

I have run into an interesting scenario on my home grid.

This grid has 6 machines and all are configured for shares.total=10, 
shares.happy=4, shares.needed=4.  (Five machines are running 1.10.0, one 
is running 1.9.2, but I don't think this matters.)

Somehow, over time, I've managed to get a file seriously unbalanced.  
One node has all ten shares, another has just two, and that's it!

{
  "results": {
   "needs-rebalancing": true,
   "count-unrecoverable-versions": 0,
   "count-good-share-hosts": 2,
   "count-shares-good": 10,
   "count-corrupt-shares": 0,
   "list-corrupt-shares": [],
   "count-shares-expected": 10,
   "healthy": true,
   "count-shares-needed": 4,
   "sharemap": {
    "0": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "1": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "2": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "3": [
     "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa",
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "4": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "5": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "6": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "7": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "8": [
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ],
    "9": [
     "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa",
     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
    ]
   },
   "count-recoverable-versions": 1,
   "count-wrong-shares": 0,
   "servers-responding": [
    "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya",
    "47cslusczp3uu2kygodi3nlalcruscif",
    "v0-7zw5vd263ktna2nuxouv5byodxrsxo4pfdokz3qixdgft7bkvlmq",
    "v0-jqs2izy4yo2wusmsso2mzkfqpqrmmbhegtxcyup7heisfrf4octa",
    "v0-rbwrud2e6alixe4xwlaynv7jbzvhn2wxbs4jniqlgu6wd5sk724q",
    "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa"
   ],
   "recoverable": true
  },
  "storage-index": "x4ahcfbulwcaltkohuz55ttwke",
  "summary": "Healthy"
}

I'm not able to re-upload this file:
allmydata.interfaces.UploadUnhappinessError: shares could be placed or 
found on 5 server(s), but they are not spread out evenly enough to 
ensure that any 4 of these servers would have enough shares to recover 
the file. We were asked to place shares on at least 4 servers such that 
any 4 of them have enough shares to recover the file. (placed all 10 
shares, want to place shares on at least 4 servers such that any 4 of 
them have enough shares to recover the file, sent 6 queries to 6 
servers, 5 queries placed some shares, 1 placed none (of which 1 placed 
none due to the server being full and 0 placed none due to an error))

Is there any command line mechanism in 1.10.0 for me to fix this, or 
must I go "outside" the system and delete files from .tahoe/storage/shares/?

Critically, this situation causes my "tahoe backup" command to fail.  
The root directory of the backup is the last thing to be created, so 
although this failure occurs ~80% of the way through my backup, it 
results in 0% of the backup being actually available.  My backups are 
getting stale.

It would be best if the backup would rebalance that file for me (which I 
know is a long-requested feature), but even without that, wouldn't it be 
better if the backup continued to run instead of stopping?  The backup 
could complete successfully with all files being recoverable, even if 
not well-balanced, and that would still have value.

And a minor gripe: when running deep-check, a failed repair results in a 
message that there was a failed repair.  But it doesn't identify which 
file couldn't be repaired!  To figure that out, I have to rerun the 
command in verbose mode, but then there's so much output that I have to 
redirect it to a file and then search for the failures.  The original 
non-verbose-mode error message should tell me specifically what failed.

-- 
Kyle Markley