[tahoe-dev] [tahoe-lafs] #970: webapi PUT via multiple nodes can cause directory corruption but does not report UncoordinatedWriteError

tahoe-lafs trac at allmydata.org
Tue Feb 23 02:00:16 PST 2010


#970: webapi PUT via multiple nodes can cause directory corruption but does not
report UncoordinatedWriteError
-------------------------------+--------------------------------------------
 Reporter:  stott              |           Owner:  nobody   
     Type:  defect             |          Status:  new      
 Priority:  minor              |       Milestone:  undecided
Component:  code-frontend-web  |         Version:  1.5.0    
 Keywords:  error usability    |   Launchpad_bug:           
-------------------------------+--------------------------------------------

Old description:

> Multiple simultaneous Tahoe put(s) via web API cause directory level
> corruption resulting in no recoverable data.
>
> To recreate
>
> Step 1.) Create Directory ; Get directory writecap.
>
> Step 2.) Using 61 .JPG files avg 1.7MB  use test.sh script to put files
> to Tahoe-Lafs.
>
> -------------------
> bash-3.2$ du -sh .
> 102m
> -------------------
> bash-3.2$ ls *.JPG |wc -l
> 61
> -------------------
> bash-3.2$ cat test.sh
> #!/bin/sh
>
> # From Directory listing itself == Directory Write CAP
> FW="URI:DIR2:tuz27wvy27ua4mt5lyotllbyke:phzv6ilb5gssi3zy33nki62zcudqjzyv7v7w4qaavwn5kuh2hawa"
>

>
> X=3456
> for I in `ls *.JPG`
>         do
> curl -T $I  http://10.20.0.151:$X/uri/$FW/$I &
> #echo "curl -T $I  http://10.20.0.151:$X/uri/$FW/$I & "
>          X=`expr $X + 1`
>         if [ $X -le 3500 ] ; then
>            echo "Submitting $I"
>            else X=3456;
>         fi
>         done
>
> -------------------------------------------
>
> Error returned from curl
>
> UnrecoverableFileError: the directory (or mutable file) could not be
> retrieved, because there were insufficient good shares. This might
> indicate that no servers were connected, insufficient servers were
> connected, the URI was corrupt, or that shares have been lost due to
> server departure, hard drive failure, or disk corruption. You should
> perform a filecheck on this object to learn more.
>

> ------------------------
>
> Error Generated when trying to retrieve known good URI from child
>
> http://codepad.org/mTFYmfxf
>
> -------------------------

New description:

 Multiple simultaneous Tahoe put(s) via web API cause directory level
 corruption resulting in no recoverable data.

 To recreate

 Step 1.) Create Directory ; Get directory writecap.

 Step 2.) Using 61 .JPG files avg 1.7MB  use test.sh script to put files to
 Tahoe-Lafs.

 -------------------
 bash-3.2$ du -sh .
 102m
 -------------------
 bash-3.2$ ls *.JPG |wc -l
 61
 -------------------
 {{{
 bash-3.2$ cat test.sh
 #!/bin/sh

 # From Directory listing itself == Directory Write CAP
 FW="URI:DIR2:tuz27wvy27ua4mt5lyotllbyke:phzv6ilb5gssi3zy33nki62zcudqjzyv7v7w4qaavwn5kuh2hawa"



 X=3456
 for I in `ls *.JPG`
         do
 curl -T $I  http://10.20.0.151:$X/uri/$FW/$I &
 #echo "curl -T $I  http://10.20.0.151:$X/uri/$FW/$I & "
          X=`expr $X + 1`
         if [ $X -le 3500 ] ; then
            echo "Submitting $I"
            else X=3456;
         fi
         done
 }}}
 -------------------------------------------

 Error returned from curl
 {{{
 UnrecoverableFileError: the directory (or mutable file) could not be
 retrieved, because there were insufficient good shares. This might
 indicate that no servers were connected, insufficient servers were
 connected, the URI was corrupt, or that shares have been lost due to
 server departure, hard drive failure, or disk corruption. You should
 perform a filecheck on this object to learn more.
 }}}

 ------------------------

 Error Generated when trying to retrieve known good URI from child

 http://codepad.org/mTFYmfxf

 -------------------------

--

Comment(by warner):

 wow, it sounds like you were actually able to provoke a real UCWE! Well, a
 real collision, at least.

 So, did really none of the PUT commands result in an error? I would have
 expected at least one of them to emit a UCWE. Re-running the test and
 sending the output of each curl instance to a separate logfile would help
 answer this question. Also double-checking that curl emits errors to
 stdout when it gets a 500 or whatever HTTP error code UCWE maps to.

 If you could, please do a file-check (with --verify) on the directory in
 question. With the dircap you show, the command would be "{{{tahoe check
 --verify --raw $FW}}}". I'm expecting to see a small number of shares of
 each version, for several different versions.

 The file-check output will tell us, but what were the encoding parameters
 in use when you ran this test? I know from another ticket you were
 experimenting with parameters on the order of 40-of-50.. if the dirnodes
 (and other mutable files) were created with these same parameters, they'd
 be much more vulnerable to UCWE than with the normal 3-of-10 encoding. If
 that was a factor here, we might want to consider separate encoding-
 parameter configs for dirnodes (or perhaps for all mutable files), so that
 you can use safer 3-of-10 for them and more efficient 40-of-50 for
 immutable bulk data. (Note that protection from UCWE comes from small "k",
 whereas the usual reliability against server problems comes from having a
 large N-k).

 (also, incidentally, when pasting large shell transcripts into a Trac page
 like this one, you should wrap the block with triple-curlies, so that Trac
 will not try to interpret the comments as WikiFormatting. And please
 attach other things as trac attachments instead of e.g. codepad links,
 because a few months from now when somebody comes back to look at this
 ticket, the pastebin will have expired and the contents lost)

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/970#comment:2>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid


More information about the tahoe-dev mailing list