[tahoe-dev] Question about changing k/N

Mon Jun 4 04:48:42 UTC 2012

Zooko,

As I went about gathering information to submit a ticket, I discovered 
something interesting.  This probably has something to do with the 
health of the target directories.  I discovered one 2/4 directory that 
works cleanly, and another 2/4 directory that gives an error (after 
*successfully* creating the link).  I haven't created a ticket yet 
because I'm not sure what's supposed to happen here.  :)

Working directory:

$ tahoe check --raw kyle:
{
  "results": {
   "needs-rebalancing": true,
   "count-shares-expected": 4,
   "healthy": false,
   "count-unrecoverable-versions": 0,
   "count-shares-needed": 2,
   "sharemap": {
    "seq52-2x5m-sh3": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq52-2x5m-sh2": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq52-2x5m-sh1": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq52-2x5m-sh0": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq49-cqp3-sh2": [
     "juwmgssmwnhrhfdcpxxmrz3bghh37esx"
    ],
    "seq49-cqp3-sh3": [
     "vjqcroalrgmft66mgiwfjug667fl6qjd"
    ]
   },
   "count-recoverable-versions": 2,
   "servers-responding": [
    "vjqcroalrgmft66mgiwfjug667fl6qjd",
    "juwmgssmwnhrhfdcpxxmrz3bghh37esx",
    "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx",
    "47cslusczp3uu2kygodi3nlalcruscif",
    "eqtprtidz5emkvzlqt27dylgocdf3f77"
   ],
   "count-good-share-hosts": 1,
   "count-wrong-shares": 2,
   "count-shares-good": 4,
   "count-corrupt-shares": 0,
   "list-corrupt-shares": [],
   "recoverable": true
  },
  "storage-index": "rsi6ge4hmbzhxplyqjzkmd254e",
  "summary": "Unhealthy: multiple versions are recoverable"
}

Gives an error:

$ tahoe check --raw share:
{
  "results": {
   "needs-rebalancing": true,
   "count-shares-expected": 4,
   "healthy": false,
   "count-unrecoverable-versions": 1,
   "count-shares-needed": 2,
   "sharemap": {
    "seq45-7bxx-sh3": [
     "juwmgssmwnhrhfdcpxxmrz3bghh37esx"
    ],
    "seq46-jiym-sh1": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq46-jiym-sh0": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq56-aiot-sh0": [
     "vjqcroalrgmft66mgiwfjug667fl6qjd"
    ],
    "seq56-aiot-sh1": [
     "vjqcroalrgmft66mgiwfjug667fl6qjd"
    ],
    "seq56-aiot-sh2": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ],
    "seq56-aiot-sh3": [
     "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
    ]
   },
   "count-recoverable-versions": 2,
   "servers-responding": [
    "vjqcroalrgmft66mgiwfjug667fl6qjd",
    "juwmgssmwnhrhfdcpxxmrz3bghh37esx",
    "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx",
    "47cslusczp3uu2kygodi3nlalcruscif",
    "eqtprtidz5emkvzlqt27dylgocdf3f77"
   ],
   "count-good-share-hosts": 2,
   "count-wrong-shares": 3,
   "count-shares-good": 4,
   "count-corrupt-shares": 0,
   "list-corrupt-shares": [],
   "recoverable": true
  },
  "storage-index": "lb2mpyg4fnznnfebfayevcdpki",
  "summary": "Unhealthy: some versions are unrecoverable multiple 
versions are recoverable"
}

My error looks something like this.  The exact message has changed; it 
used to say there was an UncoordinatedWriteError, but my experimentation 
seems to have changed things a bit, and right now I only see that error 
mentioned in the incident report.

$ tahoe ln foo: share:foo
Error: 500 Internal Server Error
"Traceback (most recent call last):\x0a  File 
\"/usr/local/lib/python2.7/site-packages/twisted/internet/base.py\", 
line 800, in runUntilCurrent\x0a    call.func(*call.args, 
**call.kw)\x0a  File 
\"/usr/local/lib/python2.7/site-packages/foolscap-0.6.3-py2.7.egg/foolscap/eventual.py\", 
line 26, in _turn\x0a    cb(*args, **kwargs)\x0a  File 
\"/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py\", 
line 368, in callback\x0a    self._startRunCallbacks(result)\x0a  File 
\"/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py\", 
line 464, in _startRunCallbacks\x0a    self._runCallbacks()\x0a--- 
<exception caught here> ---\x0a  File 
\"/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py\", 
line 551, in _runCallbacks\x0a    current.result = 
callback(current.result, *args, **kw)\x0a  File 
\"/usr/local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 
855, in <lambda>\x0a    self._modify_once(modifier, first_time))\x0a  
File 
\"/usr/local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 
881, in _modify_once\x0a    d = self._try_to_download_data()\x0a  File 
\"/usr/local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 
959, in _try_to_download_data\x0a    d = self._read(c, 
fetch_privkey=True)\x0a  File 
\"/usr/local/lib/python2.7/site-packages/allmydata/mutable/filenode.py\", line 
980, in _read\x0a    d = r.download(consumer, offset, size)\x0a  File 
\"/usr/local/lib/python2.7/site-packages/allmydata/mutable/retrieve.py\", line 
237, in download\x0a    self._setup_download()\x0a  File 
\"/usr/local/lib/python2.7/site-packages/allmydata/mutable/retrieve.py\", line 
277, in _setup_download\x0a    shares = 
versionmap[self.verinfo]\x0aexceptions.KeyError: (57, 
'\\x16i\\xdb\\xa8\\xbc\\xd7\\xabrY\\xcdpv\\xa4I\\x82\\xfe\\xa5i\\xed\\x82;\\xca\\xe8\\xcaL\\xf7\\xdav\\xa9\\xf2O\\t', 
'\\x19f_S!&\\xb0\\xa1\\xeb\\x94\\x81F)\\xbb\\x89q', 336, 335, 2, 4, 
'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x009\\x16i\\xdb\\xa8\\xbc\\xd7\\xabrY\\xcdpv\\xa4I\\x82\\xfe\\xa5i\\xed\\x82;\\xca\\xe8\\xcaL\\xf7\\xdav\\xa9\\xf2O\\t\\x19f_S!&\\xb0\\xa1\\xeb\\x94\\x81F)\\xbb\\x89q\\x02\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x01P\\x00\\x00\\x00\\x00\\x00\\x00\\x01O', 
(('enc_privkey', 923), ('EOF', 2138), ('share_data', 755), ('signature', 
399), ('block_hash_tree', 723), ('share_hash_chain', 655)))\x0a"

 From the incident report:
local#11372110 21:30:59.251: current goal: before update: , sh0 to 
[vjqcroal], sh0 to [xxaj2tgm], sh1 to [vjqcroal], sh1 to [xxaj2tgm], sh2 
to [xxaj2tgm], sh3 to [juwmgssm], sh3 to [xxaj2tgm]
local#11372111 21:30:59.251: we are planning to push new seqnum=#58
local#11372112 21:30:59.252: Starting push
local#11372113 21:30:59.252: Pushing segment 1 of 1
local#11372114 21:30:59.275: storage: slot_writev lb2mpyg4fnznnfebfayevcdpki
local#11372115 21:30:59.277: storage: slot_writev lb2mpyg4fnznnfebfayevcdpki
local#11372116 21:30:59.281: _got_write_answer from xxaj2tgm, share 2
local#11372117 21:30:59.281: found the following surprise shares: 
set([0, 1])
local#11372118 21:30:59.281: they had shares [0, 1] that we didn't know 
about [INCIDENT-TRIGGER]
local#11372119 21:31:00.433: wrote successfully: adding new share to 
servermap
local#11372120 21:31:00.435: _got_write_answer from xxaj2tgm, share 3
local#11372121 21:31:00.435: found the following surprise shares: 
set([0, 1])
local#11372122 21:31:00.435: they had shares [0, 1] that we didn't know 
about
local#11372123 21:31:00.435: wrote successfully: adding new share to 
servermap
local#11372124 21:31:00.436: _got_write_answer from vjqcroal, share 0
local#11372125 21:31:00.436: found the following surprise shares: set([])
local#11372126 21:31:00.436: wrote successfully: adding new share to 
servermap
local#11372127 21:31:00.437: _got_write_answer from vjqcroal, share 1
local#11372128 21:31:00.437: found the following surprise shares: set([])
local#11372129 21:31:00.437: wrote successfully: adding new share to 
servermap
local#11372130 21:31:00.437: Publish failed with UncoordinatedWriteError

What's up with these surprise shares?

On 06/03/12 21:04, Zooko Wilcox-O'Hearn wrote:
> Dear Kyle:
>
> Thanks for the information!
>
> There are no known bugs which would cause changing K or N values to
> lead to UncoordinatedWriteError or other errors. Please report more
> details about the errors you've encountered and let's try to reproduce
> them and narrow down their cause.
>
> https://tahoe-lafs.org/trac/tahoe-lafs/wiki/HowToReportABug
>
> Regards,
>
> Zooko
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev

-- 
Kyle Markley