#786 closed defect (fixed)

TypeError when repairing an (unrecoverable?) directory

Reported by: zooko Owned by: francois
Priority: major Milestone: 1.6.0
Component: code-mutable Version: 1.5.0
Keywords: repair verify availability usability news-done Cc:
Launchpad Bug:

Description (last modified by zooko)

I had just brought my laptop out of sleep and it hadn't yet connected to a wireless network when I clicked on the bookmark to take me to my blog. It said something to the effect that the file was unrecoverable, and I saw that the network had just about finished coming up (according to the little radio wave icon thingie at the upper-right-hand corner of my Mac OS X desktop), so I hit reload. It said:

/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/mutable/retrieve.py, line 113 in __init__
111        self._status.set_active(True)
112        (seqnum, root_hash, IV, segsize, datalength, k, N, prefix,
113         offsets_tuple) = self.verinfo
114        self._status.set_size(datalength)

Locals
verinfo	None
prefix	'6s64w'
self	<allmydata.mutable.retrieve.Retrieve instance at 0x37a4828>
<type 'exceptions.TypeError'>: 'NoneType' object is not iterable

I'll attach the full resulting error page and the two incident report files that were generated.

Attachments (3)

Change History (22)

comment:1 Changed at 2009-08-11T17:02:02Z by zooko

  • Description modified (diff)

Changed at 2009-08-11T17:02:32Z by zooko

comment:2 Changed at 2009-08-11T17:06:05Z by zooko

allmydata-tahoe: 1.4.1-r3997, foolscap: 0.4.2, pycryptopp: 0.5.15, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.31-r15675, zope.interface: 3.1.0c1, python: 2.5.4, platform: Darwin-8.11.1-i386-32bit, sqlite: 3.1.3, simplejson: 2.0.9, argparse: 0.8.0, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.0, setuptools: 0.6c12dev, pysqlite: 2.3.2

comment:3 Changed at 2009-08-11T22:40:31Z by warner

  • Component changed from code to code-mutable
  • Owner somebody deleted

Hrm. verinfo=None is a likely way for one piece of code to tell another that there are no recoverable versions, and something should notice that and raise a file-not-recoverable error instead of a confusing+ugly not-iterable error, but of course the real question is why this behavior occurred the second time you tried to download it, when presumeably all your connections had become established.

Maybe only some of your connections were established by the second attempt, and this is an error which occurs when some but not all of the shares were retrieveable.

When I get a chance, I'll try to look at the incidents you attached and see if I can distinguish between these two situations. Thanks for including all the details and incident reports.. that will greatly help to analyze this one!

comment:4 Changed at 2009-08-12T14:07:27Z by zooko

Hm, I think I might have misremembered the sequence of events. I don't think I "hit reload". I think instead I started a deep-check-verify-repair-renew on the directory that contains my blog. I'm sorry that I don't remember for sure.

comment:5 Changed at 2009-08-12T14:08:29Z by zooko

By the way, if you get a chance to reconsider #653 (introducer client: connection count is wrong, VersionedRemoteReference needs EQ), the answers to the questions on that ticket might be relevant to this ticket.

comment:6 Changed at 2009-10-27T05:37:53Z by zooko

  • Owner set to warner

Brian: any ideas or suggestions of things I can do to help on this one?

comment:7 Changed at 2009-12-28T20:37:51Z by zooko

  • Owner changed from warner to francois

Here's another report of this issue:

http://allmydata.org/pipermail/tahoe-dev/2009-December/003420.html

François reported on IRC that it was the Tahoe-LAFS 1.5.0-0ubuntu1 that comes in Ubuntu Karmic, and that the latest trunk of Tahoe-LAFS didn't have this bug. However, his follow-up on the list said that it was a specific command -- deep-check -v --add-lease --repair that failed but that doing the same actions in subsequent commands worked:

http://allmydata.org/pipermail/tahoe-dev/2009-December/003421.html

So I'm not sure if this behavior really does differ between Tahoe-LAFS v1.5.0 and current trunk. Assigning to François for clarification.

comment:8 Changed at 2009-12-28T20:40:29Z by zooko

  • Summary changed from TypeError when loading a directory while my wireless network was down to TypeError when loading a directory

comment:9 Changed at 2009-12-29T05:58:31Z by davidsarah

  • Keywords availability usability added

comment:10 follow-up: Changed at 2009-12-29T18:33:32Z by warner

  • Summary changed from TypeError when loading a directory to TypeError when repairing an (unrecoverable?) directory

I currently believe that both of these exceptions were the result of a repair attempted on a mutable file (probably a directory) which was unrecoverable. Francois' crash uses download_version, and the only place where that is used in the tahoe codebase is in the repairer. If there were no shares available, it would call download_version with None (instead of the version to download, which is expressed as a tuple of things), and you'd see this kind of crash.

incidentally, I found a bug in the repairer that incorrectly classifies an unrecoverable-but-not-completely-gone file (i.e. one with 1 or 2 shares, when k=3). It raises a MustForceRepairError with an explanation about there being unrecoverable newer versions. The intention was to raise this error when e.g. there are 9 shares of version 1, and 2 shares of version 2, since in that situation, repairing the only recoverable version (1) will knowingly discard the remaining shares of version 2, abandoning any hope of recovering the current contents of the file.

comment:11 Changed at 2009-12-29T19:09:52Z by warner

I looked at zooko's Incidents, and I think they're showing an entirely different bug. The 104230-vyc6byy incident shows a mutable file being read and written several times (there's no MODE_CHECK in there, which suggests that it's not a check-and-repair operation, just a regular read-modify-write call).

The relevant parts of the events leading up to the Incident are:

initial goal:
# 09:42:28.229 [30929]: current goal: before update: , sh0 to [jfdpabh3], sh0 to [n5hgpiok], sh1 to [fcmlx6em], sh1 to [jfdpabh3], sh2 to [jfdpabh3], sh2 to [trjdor3o], sh3 to [ivjakubr], sh3 to [xiktf6ok], sh4 to [uf7kq2sv], sh4 to [wfninubk], sh5 to [wfninubk], sh6 to [qvqv7jmm], sh6 to [trjdor3o], sh7 to [fcmlx6em], sh7 to [wfninubk], sh8 to [lwkv6cji], sh8 to [6fyx5u4z], sh8 to [7tlov7eg], sh9 to [ivjakubr], sh9 to [6fyx5u4z]

DeadReferenceError when writing sh0 to n5hgpiok

then:
# 09:42:30.089 [31045]: _got_write_answer from 6fyx5u4z
# 09:42:30.089 [31046]: entering loop
# 09:42:30.091 [31047]: current goal: before update: , sh0 to [jfdpabh3], sh1 to [fcmlx6em], sh1 to [jfdpabh3], sh2 to [jfdpabh3], sh2 to [trjdor3o], sh3 to [ivjakubr], sh3 to [xiktf6ok], sh4 to [uf7kq2sv], sh4 to [wfninubk], sh5 to [wfninubk], sh6 to [qvqv7jmm], sh6 to [trjdor3o], sh7 to [fcmlx6em], sh7 to [wfninubk], sh8 to [lwkv6cji], sh8 to [6fyx5u4z], sh8 to [7tlov7eg], sh9 to [ivjakubr], sh9 to [6fyx5u4z]
# 09:42:30.091 [31048]: we are planning to push new seqnum=#1832
# 09:42:30.091 [31049]: 6 queries still outstanding
# 09:42:30.092 [31050]: _got_write_answer from 6fyx5u4z
    * 09:42:30.092 [31051]: WEIRD our testv failed, so the write did not happen [INCIDENT-TRIGGER]
    * 09:42:31.099 [31052]: somebody modified the share on us: shnum=8: I thought they had #1832:R=lzzf, but testv reported #1832:R=lzzf
    * 09:42:31.099 [31053]: somebody modified the share on us: shnum=9: I thought they had #1832:R=lzzf, but testv reported #1832:R=lzzf

This suggests that we got *two* answers from the 6fyx server. It feels like we sent two identical requests to it. The first one succeeded normally, the second one failed. I suspect the logging code (which provides the "I thought they had.." comment) is not accurately remembering the test vector that was sent with the original message, instead it is using a local variable that has already been updated by the time the log event is emitted. So I suspect that the second answer is in response to a query which said "I think you should have version 1831, please update to v1832", and of course since the first message was processed by then, the server would already be at v1832.

The other incident (104334-hjflzua) is the same behavior on the same time, a minute later, this time trying to update to v1833.

My current guess is that we're somehow getting two copies of the same server in our peerlist, and sending updates to both.

The "error during repair" that this ticket might be about isn't reflected in these two incidents. It's likely that it wouldn't show up as an incident at all, just the traceback. I'll investigate further.

comment:12 Changed at 2009-12-29T19:32:27Z by warner

Investigating with Francois (on the 25c3 grid that he set up) is showing that a shallow check of a healthy directory is nevertheless reporting zero shares present when --add-lease is included. I now suspect problems in the code I added to source:src/allmydata/mutable/servermap.py#L548 (in _do_read) to tolerate servers who don't understand the add-lease message, since when you do add-lease on a mutable file, it sends the do-you-have-share message pipelined with the add-lease message. Older servers don't understand add-lease, so I wanted to ignore such an error, but to propagate any other errors that occurred (like local code bugs). I think that something is wrong in this code, and errors are being thrown (therefore disrupting the normal DYHB responses) when they shouldn't be.

comment:13 in reply to: ↑ 10 Changed at 2009-12-29T21:02:22Z by davidsarah

Replying to warner:

incidentally, I found a bug in the repairer that incorrectly classifies an unrecoverable-but-not-completely-gone file (i.e. one with 1 or 2 shares, when k=3). It raises a MustForceRepairError with an explanation about there being unrecoverable newer versions. The intention was to raise this error when e.g. there are 9 shares of version 1, and 2 shares of version 2, since in that situation, repairing the only recoverable version (1) will knowingly discard the remaining shares of version 2, abandoning any hope of recovering the current contents of the file.

This is now #874.

comment:14 Changed at 2009-12-29T21:05:58Z by davidsarah

  • Keywords repair added

comment:15 Changed at 2009-12-29T21:06:40Z by davidsarah

  • Keywords verify added

comment:16 Changed at 2009-12-29T21:38:51Z by warner

Ok, so the current related problems are:

  • #874: inappropriate MustForceRepairError when repairing 0<numshares<k mutable files
  • #875: make add-lease failures not break DYHB results, causing checker false-negatives
  • #786: TypeError (instead of simple failure) when repairing numshares=0 mutable files
  • ?: how did zooko's incidents see multiple answers from the same server

Francois' node experienced #875 (because all the shares were on a tahoe-1.2.0 server) followed by #786. Zooko's exception was probably either a deep-repair when the file's servers were entirely offline or the same #875 problem, followed by #786. Zooko's incident reports capture the "?" issue.

comment:17 Changed at 2009-12-29T23:51:46Z by warner

ba0690c9d7a3bc28 should fix the TypeError problem: unrepairable files are now reported with success=False instead of a weird exception (it also fixes #874).

The remaining question in my mind is where the multiple-answers-from-the-same-server incidents came from.

comment:18 Changed at 2009-12-30T04:03:59Z by warner

  • Milestone changed from undecided to 1.6.0
  • Resolution set to fixed
  • Status changed from new to closed

I've opened #877 to handle the multiple-answers bug. It turns out that a DeadReferenceError can cause the publish loop to reenter, causing multiple requests to be sent and then an incorrect UCWE to be declared.

That means this ticket can be closed, since the only remaining issue now has its own ticket.

comment:19 Changed at 2010-02-02T05:53:47Z by davidsarah

  • Keywords news-done added
Note: See TracTickets for help on using tickets.