#1628 closed defect (fixed)

UCWE on deep check with recent version

Reported by: davidsarah Owned by: kevan
Priority: critical Milestone: 1.9.2
Component: code Version: 1.9.0
Keywords: tahoe-check verify ucwe regression review-needed Cc: kpreid, killyourtv@…
Launchpad Bug:

Description (last modified by kpreid)

Reported by kpreid: I upgraded my tahoe to a recent development version, in the git mirror:

  git://github.com/warner/tahoe-lafs.git
  commit b73aba98de93c4c0b0013f1dd435c64e73e48f4c

Now, my daily deep-check --repair --add-lease on my four aliases on the volunteer grid consistently fails as follows. The first might have a legitimate uncoordinated write, but the last two are not regularly touched by anything but the repair process, and this identical failure has occurred for the past 4 days.

I'd appreciate it if this is fixed before my leases expire. :-)

ERROR: UncoordinatedWriteError()
"[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "

ERROR: AssertionError()
"[Failure instance: Traceback: <type 'exceptions.AssertionError'>: "
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:563:upload
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:_do_serialized
/Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:298:addCallback
/Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:287:addCallbacks
--- <exception caught here> ---
/Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:545:_runCallbacks
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:<lambda>
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:689:_upload
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/publish.py:402:publish

ERROR: UncoordinatedWriteError()
"[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "

ERROR: UncoordinatedWriteError()
"[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "

Before these failures, here is a typical example of normal results. (The indentation is added by my script.) I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug.

	done: 3 objects checked
	 pre-repair: 3 healthy, 0 unhealthy
	 0 repairs attempted, 0 successful, 0 failed
	 post-repair: 3 healthy, 0 unhealthy

	 repair successful
	done: 1 objects checked
	 pre-repair: 0 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 0 healthy, 1 unhealthy

	 repair successful
	done: 5 objects checked
	 pre-repair: 4 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 4 healthy, 1 unhealthy

	 repair successful
	done: 5 objects checked
	 pre-repair: 4 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 4 healthy, 1 unhealthy

Attachments (6)

incident-2011-12-06--13-30-43Z-kt36tjq.flog.bz2 (55.2 KB) - added by kpreid at 2011-12-06T13:32:46Z.
Repair failure #1 (UncoordinatedWriteError?)
incident-2011-12-06--13-30-54Z-uzuao2q.flog.bz2 (54.0 KB) - added by kpreid at 2011-12-06T13:33:41Z.
Repair failure 3 (UncoordinatedWriteError?)
incident-2011-12-06--13-31-29Z-oq7v3wi.flog.bz2 (53.8 KB) - added by kpreid at 2011-12-06T13:34:23Z.
Pushed the "Report an Incident" button
fix-1628.darcs.patch (82.7 KB) - added by kevan at 2011-12-13T01:23:13Z.
fix-1628.diff (18.4 KB) - added by kevan at 2011-12-13T02:19:28Z.
darcs-free version of fix-1628.darcs.patch
fix-1628.darcs.2.patch (84.2 KB) - added by kevan at 2011-12-27T20:38:48Z.

Download all attachments as: .zip

Change History (34)

comment:1 Changed at 2011-11-30T22:58:04Z by davidsarah

I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug.

Yes, that sounds like #766.

comment:2 Changed at 2011-12-01T00:04:32Z by kpreid

  • Description modified (diff)

comment:3 Changed at 2011-12-01T10:17:26Z by killyourtv

I had the same problem after upgrading to 1.9, reported in #1583 (though I neglected to add the full traceback).

comment:4 Changed at 2011-12-06T05:55:16Z by zooko

Thanks for the information, kpreid and killyourtv. Kevin, would you please attempt to reproduce it and see if a foolscap incident report file is generated in ~/.tahoe/logs/incidents/, and then push the "Report A Problem" button as killyourtv did in #1583?

Changed at 2011-12-06T13:34:23Z by kpreid

Pushed the "Report an Incident" button

comment:5 follow-up: Changed at 2011-12-06T13:43:09Z by kpreid

For some reason I cannot upload the second of the four incident files, which contained an AssertionError? from mutable/filenode.py:563:upload. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file. I have temporarily uploaded it here: http://switchb.org/pri/incident-2011-12-06--13-30-50Z-fhhyszi.flog.bz2

comment:6 Changed at 2011-12-10T22:55:28Z by kevan

  • Owner changed from somebody to kevan
  • Status changed from new to assigned

comment:7 Changed at 2011-12-13T01:22:29Z by kevan

Hm, it looks like the publisher forgot how to keep track of multiple copies of a single share during the MDMF implementation. So, in your case, it knows about one copy of each of a few shares that each exist on more than one server, then gets surprised when it encounters the other copies when pushing other shares, interprets the shares as evidence of an uncoordinated write, and breaks. Can you test attachment:fix-1628.darcs.patch and let me know if it resolves the issue for you?

Changed at 2011-12-13T01:23:13Z by kevan

comment:8 Changed at 2011-12-13T01:44:58Z by kpreid

I'm currently trying not to have to deal with darcs. If you can supply that as a unified diff or against the git mirror I can test it.

Changed at 2011-12-13T02:19:28Z by kevan

darcs-free version of fix-1628.darcs.patch

comment:9 Changed at 2011-12-13T02:23:18Z by kevan

Does attachment:fix-1628.diff work for you? Don't mind the references to a fifth patch; it's not related to this issue.

comment:10 follow-up: Changed at 2011-12-13T02:49:46Z by kpreid

fix-1628.diff appears to have eliminated the problem.

comment:11 in reply to: ↑ 5 Changed at 2011-12-13T21:48:16Z by zooko

Replying to kpreid:

For some reason I cannot upload the second of the four incident files, which contained an AssertionError? from mutable/filenode.py:563:upload. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file.

That's due to a misconfiguration of this Trac. #1581 My apologies.

comment:12 Changed at 2011-12-13T21:49:13Z by zooko

So is this a regression in 1.9.0 vs. earlier releases, and could it result in data loss, and should we plan a 1.9.1 release to fix it?

comment:13 in reply to: ↑ 10 Changed at 2011-12-13T22:54:46Z by killyourtv

Replying to kpreid:

fix-1628.diff appears to have eliminated the problem.

I concur, this seems to have solved my problem as well (though I want to do a bit more testing).

I assume that #1583 is probably the same as this bug. I'll close mine since this one has had a bit more activity.

comment:14 Changed at 2011-12-14T00:47:54Z by davidsarah

From the patch comments:

This tests for two regressions resulting from a design flaw in the 1.9
mutable publisher; specifically, that the publisher doesn't keep track
of more than one server for each share. This can lead to spurious UCWEs,
as seen in ticket #1628. This also means that the publisher will no
longer write shares associated with a new version of a mutable file over
all of the existing shares that it can find, which potentially decreases
the robustness of the new version of the mutable file.

This sounds to me like a regression serious enough to justify a 1.9.1. Although multiple servers holding the same share shouldn't happen if there have been only publish operations with a stable set of servers, it can easily happen if the grid membership is less stable and there have been repairs.

comment:15 Changed at 2011-12-14T00:48:51Z by davidsarah

  • Milestone changed from undecided to soon

comment:16 Changed at 2011-12-14T21:12:38Z by killyourtv

  • Cc killyourtv@… added

comment:17 follow-up: Changed at 2011-12-15T03:01:03Z by kevan

I agree with comment:14. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?

comment:18 Changed at 2011-12-16T16:52:21Z by zooko

  • Milestone changed from soon to 1.9.1
  • Priority changed from major to critical

Changed at 2011-12-27T20:38:48Z by kevan

comment:19 Changed at 2011-12-27T20:40:14Z by kevan

  • Keywords review-needed added

attachment:fix-1628.darcs.2.patch fixes a flaw in my initial patch. I think it's ready for review.

comment:20 Changed at 2011-12-28T05:39:46Z by warner

oops, I just reviewed and landed the *first* patch, in e29323f68fc5447b. I'll see if I can deduce a delta between the two darcs patches..

comment:21 Changed at 2011-12-28T05:50:43Z by Brian Warner <warner@…>

In 147670fd89a04bad:

mutable publish: fix not-enough-shares detection. Refs #1628.

This should match the "fix-1628.darcs.2.patch" attachment on that ticket.

comment:22 Changed at 2011-12-28T05:52:05Z by Brian Warner <warner@…>

In 147670fd89a04bad:

mutable publish: fix not-enough-shares detection. Refs #1628.

This should match the "fix-1628.darcs.2.patch" attachment on that ticket.

comment:23 Changed at 2011-12-28T05:57:49Z by warner

kevan: can you double-check that I got that delta right? I *think* the only part that changed was this bit:

        all_shnums = filter(lambda sh: len(self.writers[sh]) > 0,
                            self.writers.iterkeys())
        if len(all_shnums) < self.required_shares or self.surprised:
            return self._failure()

with which I fully concur. Since empty lists are falsey, you could also express it like:

all_shnums = set([shnum for shnum in self.writers if self.writers[shnum]])
# or
all_shnums = set([shnum for shnum,writers in self.writers.items() if writers])
# or, relying upon the uniqueness of dict keys:
all_shnums = [shnum for shnum,writers in self.writers.items() if writers]
# or, since we only actually care about the count of unique shnums:
shares = len([shnum for shnum,writers in self.writers.items() if writers])

(also, be aware of the DictOfSets that I use in the immutable code for tracking the shnum->servers mapping)

Should I leave this ticket open until we get that second test written?

comment:24 Changed at 2011-12-28T19:26:39Z by kevan

I altered test_multiply_placed_shares to fail if some of the shares aren't updated to the newest version on an update, so we don't need to wait for another test. I guess the git changelogs are a little stale; sorry for any confusion from that.

You caught the only important change with your delta. I also removed

        self.g.clients[0].DEFAULT_ENCODING_PARAMETERS['n'] = 75

from test_multiply_placed_shares. Placing a lot of shares (so each server holds a few shares) made the test yield an UCWE more reliably, but it still sometimes made it to the multiple version check due to #1641. It didn't seem worthwhile to set a magical encoding parameter if it didn't always work, and the test always failed without the fix in any case, so I took it out. It probably doesn't matter either way, but the test might be a little faster without that line.

Thanks for the review, the suggested alternatives, and for landing the fixes.

comment:25 Changed at 2011-12-29T00:00:40Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed

Ok, I applied that change too, in 7989fe21cc1465ac. So I think we can close this one now. Thanks!

comment:26 in reply to: ↑ 17 ; follow-up: Changed at 2011-12-31T04:53:18Z by zooko

Replying to kevan:

I agree with comment:14. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?

Kevan: did you do this investigation? Release Manager Brian said "a week or two" and the Milestone is currently marked as due on 2012-01-15, so I think we have time.

comment:27 in reply to: ↑ 26 Changed at 2011-12-31T22:06:27Z by kevan

I did -- the result is ticket #1641.

comment:28 Changed at 2012-03-25T21:48:08Z by amir

Hi all,

I'm a tahoe-lafs novice, but playing with my first shares (4 storage servers, 2 clients on 2 of the storage servers, k=2, H=4, N=5) I succeeded within short time frame to fuckup my shares. I'm not yet fully sure what we did to confuse them, but in the end we had one share of seq2 and 5 shares of seq10. A deep-check --repair alias: always resulted in this:

$ tahoe deep-check --repair -v sound:28C3 '<root>': not healthy
repair successful
'28c3.Pausenmusik.mp3': healthy
done: 2 objects checked
pre-repair: 1 healthy, 1 unhealthy
1 repairs attempted, 1 successful, 0 failed
post-repair: 1 healthy, 1 unhealthy

It was always the mutable files which became unhealty. And no number of repairs could get them fixed.

jg71 sugested to use git head, as some bugs were fixed there. I did as proposed and just found out that git head fixes all the above problems without fuss. Great work guys!

Version 0, edited at 2012-03-25T21:48:08Z by amir (next)
Note: See TracTickets for help on using tickets.