#748 closed defect (fixed)

Exception: <class 'allmydata.mutable.common.UncoordinatedWriteError'> when trying to create directory on testgrid

Reported by: toby.murray Owned by: nobody
Priority: major Milestone: 1.5.0
Component: code-mutable Version: 1.4.1
Keywords: Cc:
Launchpad Bug:

Description

In a random attempt to see what the new CSS wui looks like, I connected to testgrid.allmydata.com:3567 just now.

I clicked the "Create a directory" button and was presented with the exception listed above. I'm afraid I see no further information to provide but the error is reproducable. The actual HTML source returned is the following:

<html><head><title>Exception</title></head><body><style type="text/css">
p.error {
  color: black;
  font-family: Verdana, Arial, helvetica, sans-serif;
  font-weight: bold;
  font-size: large;
  margin: 0.25em;
}

div {
  font-family: Verdana, Arial, helvetica, sans-serif;
}

strong.variableClass {
  font-size: small;
}

div.stackTrace {
}

div.frame {
  padding: 0.25em;
  background: white;
  border-bottom: thin black dotted;
}

div.firstFrame {
  padding: 0.25em;
  background: white;
  border-top: thin black dotted;
  border-bottom: thin black dotted;
}

div.location {
    font-size: small;
}

div.snippet {
  background: #FFFFDD;
  padding: 0.25em;
}

div.snippetHighlightLine {
  color: red;
}

span.lineno {
    font-size: small;
}

pre.code {
  margin: 0px;
  padding: 0px;
  display: inline;
  font-size: small;
  font-family: "Courier New", courier, monotype;
}

span.function {
  font-weight: bold;
  font-family: "Courier New", courier, monotype;
}

table.variables {
  border-collapse: collapse;
  width: 100%;
}

td.varName {
  width: 1in;
  vertical-align: top;
  font-style: italic;
  font-size: small;
  padding-right: 0.25em;
}

td.varValue {
  padding-left: 0.25em;
  padding-right: 0.25em;
  font-size: small;
}

div.variables {
  margin-top: 0.5em;
}

div.dict {
  background: #cccc99;
  padding: 2px;
  float: left;
}

td.dictKey {
  background: #ffff99;
  font-weight: bold;
}

td.dictValue {
  background: #ffff99;
}

div.list {
  background: #7777cc;
  padding: 2px;
  float: left;
}

div.listItem {
  background: #9999ff;
}

div.instance {
  width: 100%;
  background: #efefef;
  padding: 2px;
  float: left;
}

span.instanceName {
  font-size: small;
  display: block;
}

span.instanceRepr {
  font-family: "Courier New", courier, monotype;
}

div.function {
  background: orange;
  font-weight: bold;
  float: left;
}
</style><a href="#tracebackEnd"><p class="error">&lt;class 'allmydata.mutable.common.UncoordinatedWriteError'&gt;: </p></a><div class="stackTrace"></div><a name="tracebackEnd"><p class="error">&lt;class 'allmydata.mutable.common.UncoordinatedWriteError'&gt;: </p></a></body></html>

Change History (16)

comment:1 Changed at 2009-07-04T21:54:20Z by zooko

Please look in your ~/.tahoe/logs/incidents directory and attach any incident report files that occurred around this time.

comment:2 Changed at 2009-07-04T22:03:58Z by toby.murray

I don't understand. This is not on a Tahoe instance running on my machine. It is on the testgrid at testgrid.allmydata.com. Could you visit http://testgrid.allmydata.com:3567 and see if you can reproduce this?

comment:3 Changed at 2009-07-04T22:48:40Z by zooko

Yes, I can. Thanks for the bug report!

comment:4 Changed at 2009-07-11T11:34:58Z by warner

  • Component changed from unknown to code-mutable

comment:5 Changed at 2009-07-14T02:01:06Z by zooko

  • Milestone changed from undecided to 1.5.0

Hm, I guess we should look into this.

comment:6 Changed at 2009-07-14T02:02:13Z by zooko

Soultcer just reported the same thing on IRC.

comment:7 Changed at 2009-07-14T04:36:25Z by zooko

Okay, so this could be caused by one or more of the following known issues: #651 (handle MemoryError by failing quickly and loudly) (I added a note about this issue on that issue just now, #548 (mutable publish sends queries to servers that have already been asked), #547 (mapupdate(MODE_WRITE) triggers on a false boundary), #546 (mutable-file surprise shares raise inappropriate UCWE), #540 (inappropriate "uncoordinated write error" after handling a server failure). I will investigate more in the evening after work and next weekend.

comment:8 Changed at 2009-07-14T14:11:09Z by zooko

Attached is the foolscap logtool incident report. The critical excerpt of that report looks like this:

# 05:32:47.692 [22866716]: entering loop
# 05:32:47.693 [22866717]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.693 [22866718]: we are planning to push new seqnum=#1
# 05:32:47.693 [22866719]: 10 queries still outstanding
# 05:32:47.697 [22866720]: 10 shares sent
# 05:32:47.744 [22866724]: _got_write_answer from qvqv7jmm
# 05:32:47.744 [22866725]: entering loop
# 05:32:47.745 [22866726]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.745 [22866727]: we are planning to push new seqnum=#1
# 05:32:47.745 [22866728]: 9 queries still outstanding
# 05:32:47.746 [22866729]: _got_write_answer from 6fyx5u4z
# 05:32:47.746 [22866730]: entering loop
# 05:32:47.747 [22866731]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.747 [22866732]: we are planning to push new seqnum=#1
# 05:32:47.747 [22866733]: 8 queries still outstanding
# 05:32:47.748 [22866734]: _got_write_answer from trjdor3o
# 05:32:47.748 [22866735]: entering loop
# 05:32:47.749 [22866736]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.749 [22866737]: we are planning to push new seqnum=#1
# 05:32:47.749 [22866738]: 7 queries still outstanding
# 05:32:47.750 [22866739]: _got_write_answer from fcmlx6em
# 05:32:47.751 [22866740]: entering loop
# 05:32:47.751 [22866741]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.751 [22866742]: we are planning to push new seqnum=#1
# 05:32:47.752 [22866743]: 6 queries still outstanding
# 05:32:47.753 [22866744]: _got_write_answer from fcmlx6em

    * 05:32:47.753 [22866745]: WEIRD our testv failed, so the write did not happen [INCIDENT-TRIGGER]
    * 05:32:51.776 [22866746]: somebody modified the share on us: shnum=5: I thought they had #1:R=7ahx, but testv reported #1:R=7ahx

Hm, so far I don't understand how the client was surprised by this chain of events.

But, now it is time to go to work so I'll wonder about it later.

comment:9 Changed at 2009-07-15T07:09:37Z by warner

since zooko found a MemoryError on one of the testgrid servers, this is most likely another instance of #540.

comment:10 Changed at 2009-07-15T13:38:40Z by zooko

I upgraded and restarted the tahoebs5 storage servers, but I still get an UCWE from testgrid.allmydata.org:3456.

Attached are two incident reports:

comment:11 Changed at 2009-07-16T03:38:06Z by zooko

Terrell Russell pointed out that http://testgrid.allmydata.org:3567 currently says:

Connected to 350  of 61 known storage servers

Perhaps this issue (ticket #748) is caused by #653.

comment:12 Changed at 2009-07-16T16:25:29Z by zooko

Grr. My blog broke. My blog is hosted on the Test Grid. That was really the last straw! I don't mind if everyone else who uses the Test Grid is inconvenienced by it being impossible to create directories, but I'm not going to leave my blog broken! So I upgraded the web gateway from:

allmydata-tahoe: 1.4.1-r3897, foolscap: 0.4.1, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev
allmydata-tahoe: 1.4.1-r3995, foolscap: 0.4.1, pycryptopp: 0.5.15, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, sqlite: 3.4.2, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev, pysqlite: 2.3.2

Now instead of Connected to 350 of 61 known storage servers it says Connected to 14 of 24 known storage servers and my blog works nicely. Also, people can create directories again.

This doesn't mean that this issue is resolved for v1.5 release though. We still need to investigate.

comment:13 Changed at 2009-07-18T16:51:48Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

Okay, we investigated (details posted on #653), and I'm pretty that this was caused by #653, and I'm pretty sure that #653 has been fixed. Closing this as "fixed".

Note: See TracTickets for help on using tickets.