#686 assigned defect

Search for lost share resulted in a directory popping up at unexpected place

Reported by: [4-tea-2] Owned by: daira
Priority: major Milestone: soon
Component: code-frontend-web Version: 1.4.1
Keywords: integrity error Cc:
Launchpad Bug:

Description (last modified by daira)

I'm currently running a private test grid which, over the last few weeks, grew to 20 nodes. As test data, I'm using my audio folder, I backed it up in a few stages using "tahoe backup .../audio media:audio". The grid is running "3-of-5", since all of the nodes are pretty reliable and under my control.

A couple of days, I ran a "tahoe deep-check --add-lease media:" and got a summary indicating an unhealthy file. I ran a few more deep-checks until I found the affected file ("tahoe deep-check media:" did not give the file name, "tahoe deep-check -v media:" gave the filename but at that time I didn't see it because "grep -v Healthy" also matched the "Not Healthy" message ;) - finally running deep-check from the WUI gave me the filename and the storage index).

Local file: .../audio/untagged or incomplete/Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)/Aim - Fabriclive 17.wav

Affected file in grid: media:audio/Archives/2009-04-17_23:04:36Z/untagged or incomplete/Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)/Aim - Fabriclive 17.wav

Message from "tahoe deep-check -v media:": audio/Archives/2009-04-17_23:04:36Z/untagged or incomplete/Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)/Aim - Fabriclive 17.wav: Not Healthy: 4 shares (enc 3-of-5)

Checking the file from the WUI gave me the list of the available shares, 1-4. Share 0 was gone.

Since I wanted to find out why the share vanished, zooko recommended to search .flog files for the storage index. I found 35 incident reports, most of those I checked were caused by connectivity problems (e.g. introducer not reachable, because I opened the firewall on the introducer only after installing and starting the tahoe node), none of the .flog files contained the storage index of the unhealthy file.

The file <storage idx>/0 wasn't physically present in any of the storage/ folders on any of the nodes (/1, /2, /3, /4 were).

Well, it seems one of my nodes lost a share without good reason - could that happen when a node is restarted while a share is uploading?

But here's the real weird thing:

marc@bong:~$ tahoe ls -l media:audio drwx - Apr 13 00:02 Archives dr-x - Apr 13 00:05 Latest drwx - Apr 25 00:59 untagged or incomplete marc@bong:~$ tahoe manifest media:audio/"untagged or incomplete" URI:DIR2:... URI:DIR2:... Music URI:DIR2:... Music/AIM URI:DIR2:... Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)

For reasons which are a complete mystery to me, part of the directory structure of the file with the lost share appeared in the target folder of "tahoe backup .../audio media:audio".

Not the whole directory tree was duplicated, only the folders leading to the affected file. The directory Music/ contains many more files and directories. Sadly, some of the filenames contain UTF-8 diacritics, triggering a "UnicodeEncodeError?: 'ascii' codec can't encode character u'\xe4' in position 7: ordinal not in range(128)" when I try to "tahoe ls" the directory. I can access the files from the WUI, though.

I did not try to repair the unhealthy file yet, I didn't want to spoil the chance to find the original problem.

I can supply additional info (incident reports etc.) if needed.

Attachments (1)

json.txt (1.1 KB) - added by [4-tea-2] at 2009-04-26T14:14:15Z.

Download all attachments as: .zip

Change History (11)

Changed at 2009-04-26T14:14:15Z by [4-tea-2]

comment:1 Changed at 2009-04-26T14:39:18Z by [4-tea-2]

I tried to narrow down what I could have been doing in the moment when "audio/untagged or incomplete" was created (4/25 0:59 MEST). I'm pretty sure that neither "tahoe backup" nor "tahoe deep-check" was running at the time, but according to IRC log and bash history, I did run the following command at some point between 0:45 and 1:10:

for i in $(cat ~/config/tahoe/tahoe-nodes | grep -v '#') ; do echo "$i: " ; ssh root@$i '(su - tahoe -c "tahoe debug find-shares hvr24tfh6iju5vimdgrzlpm7pa ~tahoe/.tahoe/")' ; done

(This found the four available shares for the file with the lost share.)

comment:2 Changed at 2009-04-26T14:48:41Z by zooko

By the way, my initial assumption when [4-tea-2] reported this missing share was just that the server which was supposed to receive that share was disconnected from the client at the moment of upload, but [4-tea-2] assures me that random TCP disconnections are quite rare on his network. Anyway, this wouldn't explain the mysterious appearance of one of his directories in his root dir!

comment:3 Changed at 2009-04-26T16:36:57Z by zooko

[4-tea-2] told me on IRC that he upgraded nodes from 1.3 to 1.4.1 while running "tahoe backup", which could explain the missing share -- if the upload of a share fails then the upload will continue and will treat the file as successfully uploaded, unless so many shares fail that fewer than "shares of happiness" shares were successfully uploaded. If a storage server is stopped and restarted after the client has started uploading a share to that server but before it has finished, then that share will not be successfully uploaded.

comment:4 Changed at 2009-04-26T17:56:44Z by [4-tea-2]

I found another command I executed in the time frame and found the problem: "tahoe deep-check <alias>:<path>" creates folders on the fly:

tahoe deep-check <alias>:<non-existing path>/<any filename>

will create <alias>:<non-existing path>, e.g.

tahoe deep-check media:audio/test1/test2/test3

will report "ERROR 404 Not Found No such child: test3" and create media:audio/test1/test2

Bug or feature?

comment:5 Changed at 2009-04-26T20:42:27Z by warner

it's a bug. In the webapi code that handles path traversal, there is a piece that sometimes creates a missing directory (so the traversal can continue). This is meant for use by an operation like PUT /uri/DIRCAP/parent/subdir/file.txt, and the loop looks vaguely like:

here = get_dirnode_from_dircap(DIRCAP)
for name in path[:-1]:
    if here.has_child(name):
        here = here.get_child(name)
    else:
        here = here.create_empty_directory(name)
here.add_file(path[-1], DATA)

But, of course, this create-missing-directory code should only be run when we're actually going to PUT or POST a file into that spot. Otherwise it should use a 404 to report the missing directory.

I think the bug is that this code is not examining the operation closely enough (i.e. not paying attention to the t= query argument), and creating directories for all POSTs, not just the specific kind that will upload a file.

comment:6 Changed at 2009-07-11T11:34:14Z by warner

  • Component changed from unknown to code-frontend-web

comment:7 Changed at 2010-06-12T22:22:16Z by davidsarah

  • Keywords integrity error added
  • Milestone changed from undecided to soon

comment:8 Changed at 2013-05-14T22:18:43Z by daira

  • Description modified (diff)
  • Milestone changed from soon to 1.11.0
  • Owner changed from nobody to daira
  • Status changed from new to assigned

Ugh, this is a horrible bug that should have had more attention; how did this slip through the cracks?

comment:9 Changed at 2013-05-15T17:16:56Z by zooko

I agree that this would be a good bug to fix, and that it should go into the 1.11.0 Milestone.

I'm going to play a game. I see that there are 146 open tickets in the 1.11.0 Milestone. The game is: whenever we add a ticket into the 1.11.0 Milestone, I have to either close an open ticket or move a ticket out of that Milestone. Maybe two! Okay, here I go...

comment:10 Changed at 2013-05-15T17:20:44Z by zooko

Okay, done playing my game. I moved #1382 and #1394 out of Milestone 1.11 to make room for this ticket. ☺

Last edited at 2013-05-15T17:23:13Z by zooko (previous) (diff)
Note: See TracTickets for help on using tickets.