[tahoe-dev] [tahoe-lafs] #970: webapi PUT via multiple nodes can cause directory corruption but does not report UncoordinatedWriteError
tahoe-lafs
trac at allmydata.org
Tue Feb 23 02:00:16 PST 2010
#970: webapi PUT via multiple nodes can cause directory corruption but does not
report UncoordinatedWriteError
-------------------------------+--------------------------------------------
Reporter: stott | Owner: nobody
Type: defect | Status: new
Priority: minor | Milestone: undecided
Component: code-frontend-web | Version: 1.5.0
Keywords: error usability | Launchpad_bug:
-------------------------------+--------------------------------------------
Old description:
> Multiple simultaneous Tahoe put(s) via web API cause directory level
> corruption resulting in no recoverable data.
>
> To recreate
>
> Step 1.) Create Directory ; Get directory writecap.
>
> Step 2.) Using 61 .JPG files avg 1.7MB use test.sh script to put files
> to Tahoe-Lafs.
>
> -------------------
> bash-3.2$ du -sh .
> 102m
> -------------------
> bash-3.2$ ls *.JPG |wc -l
> 61
> -------------------
> bash-3.2$ cat test.sh
> #!/bin/sh
>
> # From Directory listing itself == Directory Write CAP
> FW="URI:DIR2:tuz27wvy27ua4mt5lyotllbyke:phzv6ilb5gssi3zy33nki62zcudqjzyv7v7w4qaavwn5kuh2hawa"
>
>
> X=3456
> for I in `ls *.JPG`
> do
> curl -T $I http://10.20.0.151:$X/uri/$FW/$I &
> #echo "curl -T $I http://10.20.0.151:$X/uri/$FW/$I & "
> X=`expr $X + 1`
> if [ $X -le 3500 ] ; then
> echo "Submitting $I"
> else X=3456;
> fi
> done
>
> -------------------------------------------
>
> Error returned from curl
>
> UnrecoverableFileError: the directory (or mutable file) could not be
> retrieved, because there were insufficient good shares. This might
> indicate that no servers were connected, insufficient servers were
> connected, the URI was corrupt, or that shares have been lost due to
> server departure, hard drive failure, or disk corruption. You should
> perform a filecheck on this object to learn more.
>
> ------------------------
>
> Error Generated when trying to retrieve known good URI from child
>
> http://codepad.org/mTFYmfxf
>
> -------------------------
New description:
Multiple simultaneous Tahoe put(s) via web API cause directory level
corruption resulting in no recoverable data.
To recreate
Step 1.) Create Directory ; Get directory writecap.
Step 2.) Using 61 .JPG files avg 1.7MB use test.sh script to put files to
Tahoe-Lafs.
-------------------
bash-3.2$ du -sh .
102m
-------------------
bash-3.2$ ls *.JPG |wc -l
61
-------------------
{{{
bash-3.2$ cat test.sh
#!/bin/sh
# From Directory listing itself == Directory Write CAP
FW="URI:DIR2:tuz27wvy27ua4mt5lyotllbyke:phzv6ilb5gssi3zy33nki62zcudqjzyv7v7w4qaavwn5kuh2hawa"
X=3456
for I in `ls *.JPG`
do
curl -T $I http://10.20.0.151:$X/uri/$FW/$I &
#echo "curl -T $I http://10.20.0.151:$X/uri/$FW/$I & "
X=`expr $X + 1`
if [ $X -le 3500 ] ; then
echo "Submitting $I"
else X=3456;
fi
done
}}}
-------------------------------------------
Error returned from curl
{{{
UnrecoverableFileError: the directory (or mutable file) could not be
retrieved, because there were insufficient good shares. This might
indicate that no servers were connected, insufficient servers were
connected, the URI was corrupt, or that shares have been lost due to
server departure, hard drive failure, or disk corruption. You should
perform a filecheck on this object to learn more.
}}}
------------------------
Error Generated when trying to retrieve known good URI from child
http://codepad.org/mTFYmfxf
-------------------------
--
Comment(by warner):
wow, it sounds like you were actually able to provoke a real UCWE! Well, a
real collision, at least.
So, did really none of the PUT commands result in an error? I would have
expected at least one of them to emit a UCWE. Re-running the test and
sending the output of each curl instance to a separate logfile would help
answer this question. Also double-checking that curl emits errors to
stdout when it gets a 500 or whatever HTTP error code UCWE maps to.
If you could, please do a file-check (with --verify) on the directory in
question. With the dircap you show, the command would be "{{{tahoe check
--verify --raw $FW}}}". I'm expecting to see a small number of shares of
each version, for several different versions.
The file-check output will tell us, but what were the encoding parameters
in use when you ran this test? I know from another ticket you were
experimenting with parameters on the order of 40-of-50.. if the dirnodes
(and other mutable files) were created with these same parameters, they'd
be much more vulnerable to UCWE than with the normal 3-of-10 encoding. If
that was a factor here, we might want to consider separate encoding-
parameter configs for dirnodes (or perhaps for all mutable files), so that
you can use safer 3-of-10 for them and more efficient 40-of-50 for
immutable bulk data. (Note that protection from UCWE comes from small "k",
whereas the usual reliability against server problems comes from having a
large N-k).
(also, incidentally, when pasting large shell transcripts into a Trac page
like this one, you should wrap the block with triple-curlies, so that Trac
will not try to interpret the comments as WikiFormatting. And please
attach other things as trac attachments instead of e.g. codepad links,
because a few months from now when somebody comes back to look at this
ticket, the pastebin will have expired and the contents lost)
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/970#comment:2>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list