[volunteergrid2-l] some error
Christoph Langguth
christoph at rosenkeller.org
Fri Dec 16 20:21:32 UTC 2011
Am 16.12.2011 20:43, schrieb Iantcho Vassilev:
> Hi guys,
>
>
> I get strange error:
>
> [ianchov at localhost ~]$ ./allmydata-tahoe-1.9.0/bin/tahoe deep-check
> --repair --add-lease -v ianchov2:cveti
> '<root>': not healthy
> repair successful
> ERROR: UncoordinatedWriteError()
> "[Failure instance: Traceback (failure with no frames): <class
> 'allmydata.mutable.common.UncoordinatedWriteError'>: "
>
>
Hi Iantcho,
no ideas unfortunately, just a "me too".
I have also seen this happening with mutable directories. I haven't
found the reason for it yet, so I'm just guessing.
The name (UncoordinatedWriteError) seems to indicate an inconsistency at
the storage layer. The most prominent example would be two programs
simultaneously writing to a directory (uploading conflicting information
about the directory's contents at the same time). However, in all cases
that I have encountered this, it was definitely NOT caused by concurrent
writes, because there was only one program accessing tahoe. So, problem
1, I also don't have a clue as to WHY this is happening.
From the logs (.tahoe/logs/incidents)* it seems like the actual problem
is some "surprise share" somewhere: tahoe-lafs encounters shares for the
file/directory in question which it did not expect -- and then fails. In
theory, this seems to validate the "concurrent write" hypothesis -- but
again, in all occasions where I encountered there was no concurrency.
The only thing that I can think of is some kind of intermittent network
problem, which causes uploads to fail temporarily, and where subsequent
retries could get in the way of normal operations.
However, the biggest problem, which makes this really *nasty* is that
there is no real solution to it. "tahoe check" won't work, and "tahoe
deep-check" won't work either. The "--repair" option also does not help,
because the error obviously occurs even before repair can be attempted.
Essentially, a directory affected by this has become useless forever,
and cannot be repaired.
The only workaround that I have found so far is to completely dismiss
the directory, "copying" (actually linking) its contents to a new
directory, and then throwing the original away. Below is pseudocode
without tahoe, but it translates to tahoe commands in a straightforward way:
mkdir ianchov2:cveti2
for i in `ls ianchov2:cveti`; do ln ianchov2:cveti/$i ianchov2:cveti2/; done
unlink ianchov2:cveti
mv ianchov2:cveti2 ianchov2:cveti
This is still a far-from-optimal solution, but it is the only solution
that I know at the moment (there is remarkably little, actually close to
nothing, to be found on the internet). It still takes quite some time to
link the directory entries (about 10 secs/entry the last time I had to
use it), but at least you don't need to re-upload everything.
HTH
Chris
PS:
(*) I took a while to find this out, so maybe it is helpful: The
incident files can be read using flogtool (Ubuntu: apt-get install
python-foolscap) like so: flogtool dump
incident-2011-12-09--15-30-40Z-p3dg7ba.flog.bz2 |less
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6161 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20111216/2e2fc97e/attachment.bin>
More information about the volunteergrid2-l
mailing list