#899 closed defect (fixed)

UncoordinatedWriteError on prod grid

Reported by: zooko Owned by: kmarkley86
Priority: major Milestone: undecided
Component: code-mutable Version: 1.5.0
Keywords: availability reliability upload Cc: kmarkley86
Launchpad Bug:

Description

Kyle Markley reported this on the tahoe-dev list:

http://allmydata.org/pipermail/tahoe-dev/2010-January/003554.html

It could be related to #540, #877, or #893.

I'll ask Kyle to supply more diagnostic info on this ticket.

Attachments (2)

logs.tgz (1.2 MB) - added by kmarkley86 at 2010-01-14T06:32:41Z.
UncoordinatedWriteError? log
tahoeIncident.7z (466.0 KB) - added by zooko at 2010-01-14T15:23:53Z.

Download all attachments as: .zip

Change History (12)

comment:1 Changed at 2010-01-13T18:45:13Z by zooko

  • Owner set to kmarkley86

comment:2 Changed at 2010-01-13T19:41:35Z by davidsarah

  • Keywords upload added

Changed at 2010-01-14T06:32:41Z by kmarkley86

comment:3 Changed at 2010-01-14T06:34:19Z by kmarkley86

  • Cc kmarkley86 added

allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-_CPU_000_@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

Mutable File Publish Status

  • Started: 00:04:12 13-Jan-2010
  • Storage Index: mcw73tlgpejftxf55c5bjmiczi
  • Helper?: No
  • Current Size: 470
  • Progress: 20.0%
  • Status: UncoordinatedWriteError?

Retrieve Results

  • Encoding: 3 of 10
  • Sharemap:

o 0 -> Placed on [ehnfmjtc] o 4 -> Placed on [5q4fx2pb] o 5 -> Placed on [ctchgzgn]

  • Timings:

o Total: 1.24s (380Bps)

+ Setup: 581us + Encrypting: 37us (12.40MBps) + Encoding: 55us (8.53MBps) + Packing Shares: 9.0ms (52.1kBps)

# RSA Signature: 8.0ms

+ Pushing: 1.23s (383Bps)

o Per-Server Response Times:

+ [ctchgzgn]: 77ms + [ehnfmjtc]: 67ms + [fjsasmll]: 1.18s + [gi3daw4h]: 1.12s + [xc3w2uzy]: 1.19s + [5q4fx2pb]: 1.18s + [6m245fmk]: 103ms

comment:4 Changed at 2010-01-14T15:23:14Z by zooko

Andrej Falout couldn't attach his incident reports to this ticket because trac doesn't let you upload attachments larger than 1,000,000 bytes. I bunzip2'ed them and 7z'ed them and they came out half as big, so here they are.

Changed at 2010-01-14T15:23:53Z by zooko

comment:5 Changed at 2010-01-14T15:24:40Z by zooko

Oh, and I reconfigured trac to allow attachments of up to 10 MB.

comment:6 Changed at 2010-01-17T01:25:39Z by kmarkley86

I'm continuing to hit this UncoordinatedWriteError? very frequently on the production grid. I think it happens most often when creating directories. I can provide lots of additional incident reports if that would be useful.

This has made it almost impossible for me to run a 'tahoe backup' command to the production grid; should the priority of this ticket be raised?

comment:7 Changed at 2010-01-17T04:20:15Z by zooko

allmydata.com is continuing to repair servers and configuration issues on the allmydata.com prod grid, so that might be the way that your problem gets solved. However, at the very least your Tahoe-LAFS client is reporting something with a wrong error message. It may also be buggy in some way that leads to this problem.

One thing that you could do that would help is to try the same thing with a newer version of Tahoe-LAFS. Could you try installing the latest version http://allmydata.org/source/tahoe/tarballs/?C=M;O=D , per these install instructions: http://allmydata.org/source/tahoe/trunk/docs/install.html ?

comment:8 Changed at 2010-01-17T17:23:08Z by kmarkley86

I haven't seen one of these errors since upgrading from tahoe 1.5.0 to 1.5.0-r4160. Between that and general repair of the grid, the problem has gone away for me.

comment:9 Changed at 2010-01-26T20:00:54Z by warner

I glanced through a couple of these Incidents, and all the ones I looked at were that artifact that we fixed in which DeadReferenceError is logged too severely by accident (the one where the ServerFailure that wrapped the DeadReferenceError, preventing the errback code from identifying it as a DeadReferenceError). This got fixed with the overhaul of the add-lease code.

comment:10 Changed at 2010-02-15T19:38:43Z by davidsarah

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.