#1689 closed defect (fixed)
assertion failure
Reported by: | jg71 | Owned by: | zooko |
---|---|---|---|
Priority: | critical | Milestone: | 1.9.2 |
Component: | code-mutable | Version: | 1.9.1 |
Keywords: | tahoe-check verify regression assertion error mutable review-needed | Cc: | |
Launchpad Bug: |
Description (last modified by jg71)
tahoe deep-check -v --repair --add-lease tahoe:
'<root>': healthy done: 1 objects checked pre-repair: 1 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 1 healthy, 0 unhealthy
when 9 out of 9 storage servers are available.
The error I have found is reproducable by just stopping one storage server and run once more: tahoe deep-check -v --repair --add-lease tahoe:
ERROR: AssertionError() "[Failure instance: Traceback: <type 'exceptions.AssertionError'>: " /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:563:upload /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:_do_serialized /usr/lib64/python2.6/site-packages/twisted/internet/defer.py:298:addCallback /usr/lib64/python2.6/site-packages/twisted/internet/defer.py:287:addCallbacks --- <exception caught here> --- /usr/lib64/python2.6/site-packages/twisted/internet/defer.py:545:_runCallbacks /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:<lambda> /usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:689:_upload /usr/lib64/python2.6/site-packages/allmydata/mutable/publish.py:404:publish
I can see from the second capture (attached flogtool.tail-2.txt) that tahoe connects exactly to that storage server that has been stopped: connectTCP to ('256.256.256.256', 66666)
There were no incident report files.
Versions used locally:
Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: allmydata-tahoe-1.9.0-94-gcef646c, pycrypto: 2.5, pyasn1: 0.0.13
all 9 storage servers use these: Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, pycrypto: 2.4.1, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: 1.9.1, pyasn1: 0.0.13, pycryptopp: 0.5.29
Might this be related to https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1648 ?
Attachments (2)
Change History (21)
Changed at 2012-03-21T13:49:03Z by jg71
comment:1 Changed at 2012-03-21T13:51:21Z by jg71
- Description modified (diff)
comment:2 Changed at 2012-03-21T14:19:20Z by jg71
- Description modified (diff)
comment:3 Changed at 2012-03-21T14:39:52Z by zooko
- Priority changed from major to critical
comment:4 Changed at 2012-03-21T15:31:23Z by jg71
comment:5 Changed at 2012-03-29T23:45:19Z by warner
I've been able to reproduce the error locally. It's weird, the CLI command returns the AssertionError, but nothing is logged by the node. Usually it's the other way around: the node notices an error, logs it, recovers by using some alternate share, and the CLI command doesn't see anything wrong.
The assertion error is on the assert self._privkey line, which means that the repair's Publish operation is trying to use a filenode that never got a privkey. This can happen when the filenode was created from a readcap, or when have a writecap but were unable (or maybe just forgot) to fetch the privkey earlier.
If it depends upon the contents of the directory, then maybe it's really depending upon the length of the share.. if something were fetching a portion of the share that usually contains the privkey, but didn't for really short shares (empty directories), that might explain it. I'll look earlier in the repair process to see where the privkey is usually fetched, and if there's any code to complain if that fails.
comment:6 Changed at 2012-03-30T00:32:14Z by warner
More findings:
- doing a non-deep tahoe check --repair gets the same failure, but gets a webapi 500 Internal Server Error (but successfully repairs the file, I think).
- the Retrieve half of the repair is doing a mapupdate with MODE_READ (which doesn't try to fetch the privkey), but it needs to be using MODE_WRITE (to force a privkey fetch)
- the mapupdate does a few queries that do, in fact, get the privkey, but the responses arrive too late for the Publish to use them.
Maybe there's a race condition, and larger files allow mapupdate to win the race and get the privkey in time for Publish to use it, but small (empty) files allow the Publish to start too early. The MODE_READ is definitely wrong, though: not only does it not bother to get the privkey in that case, it also doesn't search as hard to find all shares (MODE_WRITE tries harder, to avoid leaving any share behind at an old version).
comment:7 Changed at 2012-03-30T01:03:15Z by davidsarah
- Component changed from unknown to code-mutable
- Milestone changed from undecided to 1.9.2
comment:8 Changed at 2012-03-31T19:51:01Z by zooko
- Keywords review-needed added
comment:9 Changed at 2012-03-31T20:12:18Z by warner
Ok 2b8a312c and 5bae4a1b should fix that (and includes a test that fails without the fix). To make the test fail, I had to introduce some extra delays into the fake storage-server response (so the privkey would arrive to late to help the Publish succeed). I'm still trying to understand why the filesize matters, but this basic issue should be solved now.
comment:10 follow-up: ↓ 13 Changed at 2012-04-01T01:22:46Z by davidsarah
- Owner changed from nobody to warner
On line 1066 of servermap.py as changed by 5bae4a1b:
if self.mode == (MODE_CHECK, MODE_REPAIR):
See the bug? :-)
comment:11 follow-up: ↓ 12 Changed at 2012-04-01T01:32:03Z by jg71
I applied fix-1689.diff and the error is gone. But I noticed the following while running "tahoe deep-check -v --repair --add-lease alias:"
- if one storage server is shut down, every dir cap (incl. root) is marked unhealthy and repaired, as are all files, except lit caps which are shown healthy
- if all servers are online, by default all dir caps are marked unhealthy and repaired but not quite:
done: 4 objects checked pre-repair: 3 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 3 healthy, 1 unhealthy
The unhealthy one is the dir root cap. Another alias which includes a directory shows the same behaviour:
done: 69 objects checked pre-repair: 0 healthy, 69 unhealthy 69 repairs attempted, 69 successful, 0 failed post-repair: 67 healthy, 2 unhealthy
Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.
comment:12 in reply to: ↑ 11 Changed at 2012-04-01T01:41:54Z by davidsarah
Replying to jg71:
Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.
That's as expected; a LIT cap can't be unhealthy, since the cap itself has the data.
comment:13 in reply to: ↑ 10 Changed at 2012-04-01T22:43:36Z by warner
Replying to davidsarah:
On line 1066 of servermap.py as changed by 5bae4a1b:
if self.mode == (MODE_CHECK, MODE_REPAIR):See the bug? :-)
Oops.. good catch! I just pushed the fix, in 470acbf1
comment:14 Changed at 2012-04-02T00:03:01Z by zooko
470acbf1e1d0a525 apparently didn't cause any test to go red→green. Maybe that code is redundant?
comment:15 Changed at 2012-04-05T16:22:37Z by zooko
- Owner changed from warner to zooko
- Status changed from new to assigned
comment:16 Changed at 2012-04-05T17:07:00Z by jg71
I tested latest git and ran the same tests again. I cannot reproduce the error anymore and no weirdness occurs. Looks like it is fixed.
comment:17 Changed at 2012-04-05T18:35:02Z by jg71
- Resolution set to fixed
- Status changed from assigned to closed
comment:18 Changed at 2012-04-05T19:46:16Z by davidsarah
Before it was edited, comment:16 said:
I tested latest git and ran the same tests again. At first it looked ok, the weirdness was gone (that suddenly all dir caps show up as unhealthy). But after having re-run the repair job a few times with one storage node down, some aliases always show up completely healthy, and others show the weird behaviour again; all files are shown healthy but for those few aliases all dir caps are shown unhealthy, get repaired, and still show unhealthy while their repair was claimed to be successful:
*** test3 '<root>': not healthy repair successful 'test.txt': healthy done: 2 objects checked pre-repair: 1 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 1 healthy, 1 unhealthy *** test4 '<root>': not healthy repair successful done: 1 objects checked pre-repair: 0 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 0 healthy, 1 unhealthy
The fact that caps were shown to be unhealthy even though they had been successfully repaired, sounds like ticket #766.
comment:19 Changed at 2012-04-05T20:18:13Z by jg71
That was due to a PEBKAC on my part earlier today which I fixed. Afterwards I ran the tests twice to make sure. Therefore I think that the https://tahoe-lafs.org/trac/tahoe-lafs/ticket/766 chase is a false alarm caused by my original comment #16.
Sorry for the noise.
placing a single file in that directory is sufficient to hide the error. even a LIT cap works.