#1689 closed defect (fixed)

assertion failure

Reported by: jg71 Owned by: zooko
Priority: critical Milestone: 1.9.2
Component: code-mutable Version: 1.9.1
Keywords: tahoe-check verify regression assertion error mutable review-needed Cc:
Launchpad Bug:

Description (last modified by jg71)

tahoe deep-check -v --repair --add-lease tahoe:

'<root>': healthy
done: 1 objects checked
 pre-repair: 1 healthy, 0 unhealthy
 0 repairs attempted, 0 successful, 0 failed
 post-repair: 1 healthy, 0 unhealthy

when 9 out of 9 storage servers are available.

The error I have found is reproducable by just stopping one storage server and run once more: tahoe deep-check -v --repair --add-lease tahoe:

ERROR: AssertionError()
"[Failure instance: Traceback: <type 'exceptions.AssertionError'>: "
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:563:upload
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:_do_serialized
/usr/lib64/python2.6/site-packages/twisted/internet/defer.py:298:addCallback
/usr/lib64/python2.6/site-packages/twisted/internet/defer.py:287:addCallbacks
--- <exception caught here> ---
/usr/lib64/python2.6/site-packages/twisted/internet/defer.py:545:_runCallbacks
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:661:<lambda>
/usr/lib64/python2.6/site-packages/allmydata/mutable/filenode.py:689:_upload
/usr/lib64/python2.6/site-packages/allmydata/mutable/publish.py:404:publish

I can see from the second capture (attached flogtool.tail-2.txt) that tahoe connects exactly to that storage server that has been stopped: connectTCP to ('256.256.256.256', 66666)

There were no incident report files.

Versions used locally:

Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: allmydata-tahoe-1.9.0-94-gcef646c, pycrypto: 2.5, pyasn1: 0.0.13

all 9 storage servers use these: Nevow: 0.10.0, foolscap: 0.6.3, setuptools: 0.6c11, Twisted: 11.1.0, zfec: 1.4.22, pycrypto: 2.4.1, zbase32: 1.1.3, pyOpenSSL: 0.13, simplejson: 2.3.2, mock: 0.7.2, argparse: 1.2.1, pyutil: 1.8.4, zope.interface: 3.8.0, allmydata-tahoe: 1.9.1, pyasn1: 0.0.13, pycryptopp: 0.5.29

Might this be related to https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1648 ?

Attachments (2)

flogtool.tail-2.txt.gz (5.8 KB) - added by jg71 at 2012-03-21T13:49:03Z.
fix-1689.diff (12.7 KB) - added by warner at 2012-03-31T07:00:15Z.
fix and test for the bug

Download all attachments as: .zip

Change History (21)

Changed at 2012-03-21T13:49:03Z by jg71

comment:1 Changed at 2012-03-21T13:51:21Z by jg71

  • Description modified (diff)

comment:2 Changed at 2012-03-21T14:19:20Z by jg71

  • Description modified (diff)

comment:3 Changed at 2012-03-21T14:39:52Z by zooko

  • Priority changed from major to critical

comment:4 Changed at 2012-03-21T15:31:23Z by jg71

placing a single file in that directory is sufficient to hide the error. even a LIT cap works.

comment:5 Changed at 2012-03-29T23:45:19Z by warner

I've been able to reproduce the error locally. It's weird, the CLI command returns the AssertionError, but nothing is logged by the node. Usually it's the other way around: the node notices an error, logs it, recovers by using some alternate share, and the CLI command doesn't see anything wrong.

The assertion error is on the assert self._privkey line, which means that the repair's Publish operation is trying to use a filenode that never got a privkey. This can happen when the filenode was created from a readcap, or when have a writecap but were unable (or maybe just forgot) to fetch the privkey earlier.

If it depends upon the contents of the directory, then maybe it's really depending upon the length of the share.. if something were fetching a portion of the share that usually contains the privkey, but didn't for really short shares (empty directories), that might explain it. I'll look earlier in the repair process to see where the privkey is usually fetched, and if there's any code to complain if that fails.

comment:6 Changed at 2012-03-30T00:32:14Z by warner

More findings:

  • doing a non-deep tahoe check --repair gets the same failure, but gets a webapi 500 Internal Server Error (but successfully repairs the file, I think).
  • the Retrieve half of the repair is doing a mapupdate with MODE_READ (which doesn't try to fetch the privkey), but it needs to be using MODE_WRITE (to force a privkey fetch)
  • the mapupdate does a few queries that do, in fact, get the privkey, but the responses arrive too late for the Publish to use them.

Maybe there's a race condition, and larger files allow mapupdate to win the race and get the privkey in time for Publish to use it, but small (empty) files allow the Publish to start too early. The MODE_READ is definitely wrong, though: not only does it not bother to get the privkey in that case, it also doesn't search as hard to find all shares (MODE_WRITE tries harder, to avoid leaving any share behind at an old version).

comment:7 Changed at 2012-03-30T01:03:15Z by davidsarah

  • Component changed from unknown to code-mutable
  • Milestone changed from undecided to 1.9.2

Changed at 2012-03-31T07:00:15Z by warner

fix and test for the bug

comment:8 Changed at 2012-03-31T19:51:01Z by zooko

  • Keywords review-needed added

comment:9 Changed at 2012-03-31T20:12:18Z by warner

Ok 2b8a312c and 5bae4a1b should fix that (and includes a test that fails without the fix). To make the test fail, I had to introduce some extra delays into the fake storage-server response (so the privkey would arrive to late to help the Publish succeed). I'm still trying to understand why the filesize matters, but this basic issue should be solved now.

comment:10 follow-up: Changed at 2012-04-01T01:22:46Z by davidsarah

  • Owner changed from nobody to warner

On line 1066 of servermap.py as changed by 5bae4a1b:

if self.mode == (MODE_CHECK, MODE_REPAIR):

See the bug? :-)

comment:11 follow-up: Changed at 2012-04-01T01:32:03Z by jg71

I applied fix-1689.diff and the error is gone. But I noticed the following while running "tahoe deep-check -v --repair --add-lease alias:"

  • if one storage server is shut down, every dir cap (incl. root) is marked unhealthy and repaired, as are all files, except lit caps which are shown healthy
  • if all servers are online, by default all dir caps are marked unhealthy and repaired but not quite:
done: 4 objects checked
 pre-repair: 3 healthy, 1 unhealthy
 1 repairs attempted, 1 successful, 0 failed
 post-repair: 3 healthy, 1 unhealthy

The unhealthy one is the dir root cap. Another alias which includes a directory shows the same behaviour:

done: 69 objects checked
 pre-repair: 0 healthy, 69 unhealthy
 69 repairs attempted, 69 successful, 0 failed
 post-repair: 67 healthy, 2 unhealthy

Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.

comment:12 in reply to: ↑ 11 Changed at 2012-04-01T01:41:54Z by davidsarah

Replying to jg71:

Noteworthy is the fact that lit caps always show up as healthy, if all servers are there or one missing.

That's as expected; a LIT cap can't be unhealthy, since the cap itself has the data.

comment:13 in reply to: ↑ 10 Changed at 2012-04-01T22:43:36Z by warner

Replying to davidsarah:

On line 1066 of servermap.py as changed by 5bae4a1b:

if self.mode == (MODE_CHECK, MODE_REPAIR):

See the bug? :-)

Oops.. good catch! I just pushed the fix, in 470acbf1

comment:14 Changed at 2012-04-02T00:03:01Z by zooko

470acbf1e1d0a525 apparently didn't cause any test to go red→green. Maybe that code is redundant?

comment:15 Changed at 2012-04-05T16:22:37Z by zooko

  • Owner changed from warner to zooko
  • Status changed from new to assigned

comment:16 Changed at 2012-04-05T17:07:00Z by jg71

I tested latest git and ran the same tests again. I cannot reproduce the error anymore and no weirdness occurs. Looks like it is fixed.

Version 1, edited at 2012-04-05T17:33:02Z by jg71 (previous) (next) (diff)

comment:17 Changed at 2012-04-05T18:35:02Z by jg71

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:18 Changed at 2012-04-05T19:46:16Z by davidsarah

Before it was edited, comment:16 said:

I tested latest git and ran the same tests again. At first it looked ok, the weirdness was gone (that suddenly all dir caps show up as unhealthy). But after having re-run the repair job a few times with one storage node down, some aliases always show up completely healthy, and others show the weird behaviour again; all files are shown healthy but for those few aliases all dir caps are shown unhealthy, get repaired, and still show unhealthy while their repair was claimed to be successful:

*** test3
        '<root>': not healthy
         repair successful
        'test.txt': healthy
        done: 2 objects checked
         pre-repair: 1 healthy, 1 unhealthy
         1 repairs attempted, 1 successful, 0 failed
         post-repair: 1 healthy, 1 unhealthy

*** test4
        '<root>': not healthy
         repair successful
        done: 1 objects checked
         pre-repair: 0 healthy, 1 unhealthy
         1 repairs attempted, 1 successful, 0 failed
         post-repair: 0 healthy, 1 unhealthy

The fact that caps were shown to be unhealthy even though they had been successfully repaired, sounds like ticket #766.

comment:19 Changed at 2012-04-05T20:18:13Z by jg71

That was due to a PEBKAC on my part earlier today which I fixed. Afterwards I ran the tests twice to make sure. Therefore I think that the https://tahoe-lafs.org/trac/tahoe-lafs/ticket/766 chase is a false alarm caused by my original comment #16.

Sorry for the noise.

Note: See TracTickets for help on using tickets.