[volunteergrid2-l] some error

Fri Dec 16 20:21:32 UTC 2011

Am 16.12.2011 20:43, schrieb Iantcho Vassilev:
> Hi guys,
>
>
> I get strange error:
>
> [ianchov at localhost ~]$ ./allmydata-tahoe-1.9.0/bin/tahoe deep-check
> --repair --add-lease -v ianchov2:cveti
> '<root>': not healthy
>   repair successful
> ERROR: UncoordinatedWriteError()
> "[Failure instance: Traceback (failure with no frames): <class
> 'allmydata.mutable.common.UncoordinatedWriteError'>: "
>
>
Hi Iantcho,

no ideas unfortunately, just a "me too".

I have also seen this happening with mutable directories. I haven't 
found the reason for it yet, so I'm just guessing.

The name (UncoordinatedWriteError) seems to indicate an inconsistency at 
the storage layer. The most prominent example would be two programs 
simultaneously writing to a directory (uploading conflicting information 
about the directory's contents at the same time). However, in all cases 
that I have encountered this, it was definitely NOT caused by concurrent 
writes, because there was only one program accessing tahoe. So, problem 
1, I also don't have a clue as to WHY this is happening.

 From the logs (.tahoe/logs/incidents)* it seems like the actual problem 
is some "surprise share" somewhere: tahoe-lafs encounters shares for the 
file/directory in question which it did not expect -- and then fails. In 
theory, this seems to validate the "concurrent write" hypothesis -- but 
again, in all occasions where I encountered there was no concurrency. 
The only thing that I can think of is some kind of intermittent network 
problem, which causes uploads to fail temporarily, and where subsequent 
retries could get in the way of normal operations.

However, the biggest problem, which makes this really *nasty* is that 
there is no real solution to it. "tahoe check" won't work, and "tahoe 
deep-check" won't work either. The "--repair" option also does not help, 
because the error obviously occurs even before repair can be attempted. 
Essentially, a directory affected by this has become useless forever, 
and cannot be repaired.

The only workaround that I have found so far is to completely dismiss 
the directory, "copying" (actually linking) its contents to a new 
directory, and then throwing the original away. Below is pseudocode 
without tahoe, but it translates to tahoe commands in a straightforward way:

mkdir ianchov2:cveti2
for i in `ls ianchov2:cveti`; do ln ianchov2:cveti/$i ianchov2:cveti2/; done
unlink ianchov2:cveti
mv ianchov2:cveti2 ianchov2:cveti

This is still a far-from-optimal solution, but it is the only solution 
that I know at the moment (there is remarkably little, actually close to 
nothing, to be found on the internet). It still takes quite some time to 
link the directory entries (about 10 secs/entry the last time I had to 
use it), but at least you don't need to re-upload everything.

HTH
Chris

PS:
(*) I took a while to find this out, so maybe it is helpful: The 
incident files can be read using flogtool (Ubuntu: apt-get install 
python-foolscap) like so: flogtool dump 
incident-2011-12-09--15-30-40Z-p3dg7ba.flog.bz2 |less

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6161 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20111216/2e2fc97e/attachment.bin>