Opened at 2009-04-26T13:36:11Z
Last modified at 2013-05-15T17:23:13Z
#686 assigned defect
Search for lost share resulted in a directory popping up at unexpected place
Reported by: | [4-tea-2] | Owned by: | daira |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-frontend-web | Version: | 1.4.1 |
Keywords: | integrity error | Cc: | |
Launchpad Bug: |
Description (last modified by daira)
I'm currently running a private test grid which, over the last few weeks, grew to 20 nodes. As test data, I'm using my audio folder, I backed it up in a few stages using "tahoe backup .../audio media:audio". The grid is running "3-of-5", since all of the nodes are pretty reliable and under my control.
A couple of days, I ran a "tahoe deep-check --add-lease media:" and got a summary indicating an unhealthy file. I ran a few more deep-checks until I found the affected file ("tahoe deep-check media:" did not give the file name, "tahoe deep-check -v media:" gave the filename but at that time I didn't see it because "grep -v Healthy" also matched the "Not Healthy" message ;) - finally running deep-check from the WUI gave me the filename and the storage index).
Local file: .../audio/untagged or incomplete/Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)/Aim - Fabriclive 17.wav
Affected file in grid: media:audio/Archives/2009-04-17_23:04:36Z/untagged or incomplete/Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)/Aim - Fabriclive 17.wav
Message from "tahoe deep-check -v media:": audio/Archives/2009-04-17_23:04:36Z/untagged or incomplete/Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)/Aim - Fabriclive 17.wav: Not Healthy: 4 shares (enc 3-of-5)
Checking the file from the WUI gave me the list of the available shares, 1-4. Share 0 was gone.
Since I wanted to find out why the share vanished, zooko recommended to search .flog files for the storage index. I found 35 incident reports, most of those I checked were caused by connectivity problems (e.g. introducer not reachable, because I opened the firewall on the introducer only after installing and starting the tahoe node), none of the .flog files contained the storage index of the unhealthy file.
The file <storage idx>/0 wasn't physically present in any of the storage/ folders on any of the nodes (/1, /2, /3, /4 were).
Well, it seems one of my nodes lost a share without good reason - could that happen when a node is restarted while a share is uploading?
But here's the real weird thing:
marc@bong:~$ tahoe ls -l media:audio drwx - Apr 13 00:02 Archives dr-x - Apr 13 00:05 Latest drwx - Apr 25 00:59 untagged or incomplete marc@bong:~$ tahoe manifest media:audio/"untagged or incomplete" URI:DIR2:... URI:DIR2:... Music URI:DIR2:... Music/AIM URI:DIR2:... Music/AIM/Aim - Fabriclive 17 (FLAC - CUE - EAC)
For reasons which are a complete mystery to me, part of the directory structure of the file with the lost share appeared in the target folder of "tahoe backup .../audio media:audio".
Not the whole directory tree was duplicated, only the folders leading to the affected file. The directory Music/ contains many more files and directories. Sadly, some of the filenames contain UTF-8 diacritics, triggering a "UnicodeEncodeError?: 'ascii' codec can't encode character u'\xe4' in position 7: ordinal not in range(128)" when I try to "tahoe ls" the directory. I can access the files from the WUI, though.
I did not try to repair the unhealthy file yet, I didn't want to spoil the chance to find the original problem.
I can supply additional info (incident reports etc.) if needed.
Attachments (1)
Change History (11)
Changed at 2009-04-26T14:14:15Z by [4-tea-2]
comment:1 Changed at 2009-04-26T14:39:18Z by [4-tea-2]
comment:2 Changed at 2009-04-26T14:48:41Z by zooko
By the way, my initial assumption when [4-tea-2] reported this missing share was just that the server which was supposed to receive that share was disconnected from the client at the moment of upload, but [4-tea-2] assures me that random TCP disconnections are quite rare on his network. Anyway, this wouldn't explain the mysterious appearance of one of his directories in his root dir!
comment:3 Changed at 2009-04-26T16:36:57Z by zooko
[4-tea-2] told me on IRC that he upgraded nodes from 1.3 to 1.4.1 while running "tahoe backup", which could explain the missing share -- if the upload of a share fails then the upload will continue and will treat the file as successfully uploaded, unless so many shares fail that fewer than "shares of happiness" shares were successfully uploaded. If a storage server is stopped and restarted after the client has started uploading a share to that server but before it has finished, then that share will not be successfully uploaded.
comment:4 Changed at 2009-04-26T17:56:44Z by [4-tea-2]
I found another command I executed in the time frame and found the problem: "tahoe deep-check <alias>:<path>" creates folders on the fly:
tahoe deep-check <alias>:<non-existing path>/<any filename>
will create <alias>:<non-existing path>, e.g.
tahoe deep-check media:audio/test1/test2/test3
will report "ERROR 404 Not Found No such child: test3" and create media:audio/test1/test2
Bug or feature?
comment:5 Changed at 2009-04-26T20:42:27Z by warner
it's a bug. In the webapi code that handles path traversal, there is a piece that sometimes creates a missing directory (so the traversal can continue). This is meant for use by an operation like PUT /uri/DIRCAP/parent/subdir/file.txt, and the loop looks vaguely like:
here = get_dirnode_from_dircap(DIRCAP) for name in path[:-1]: if here.has_child(name): here = here.get_child(name) else: here = here.create_empty_directory(name) here.add_file(path[-1], DATA)
But, of course, this create-missing-directory code should only be run when we're actually going to PUT or POST a file into that spot. Otherwise it should use a 404 to report the missing directory.
I think the bug is that this code is not examining the operation closely enough (i.e. not paying attention to the t= query argument), and creating directories for all POSTs, not just the specific kind that will upload a file.
comment:6 Changed at 2009-07-11T11:34:14Z by warner
- Component changed from unknown to code-frontend-web
comment:7 Changed at 2010-06-12T22:22:16Z by davidsarah
- Keywords integrity error added
- Milestone changed from undecided to soon
comment:8 Changed at 2013-05-14T22:18:43Z by daira
- Description modified (diff)
- Milestone changed from soon to 1.11.0
- Owner changed from nobody to daira
- Status changed from new to assigned
Ugh, this is a horrible bug that should have had more attention; how did this slip through the cracks?
comment:9 Changed at 2013-05-15T17:16:56Z by zooko
I agree that this would be a good bug to fix, and that it should go into the 1.11.0 Milestone.
I'm going to play a game. I see that there are 146 open tickets in the 1.11.0 Milestone. The game is: whenever we add a ticket into the 1.11.0 Milestone, I have to either close an open ticket or move a ticket out of that Milestone. Maybe two! Okay, here I go...
I tried to narrow down what I could have been doing in the moment when "audio/untagged or incomplete" was created (4/25 0:59 MEST). I'm pretty sure that neither "tahoe backup" nor "tahoe deep-check" was running at the time, but according to IRC log and bash history, I did run the following command at some point between 0:45 and 1:10:
for i in $(cat ~/config/tahoe/tahoe-nodes | grep -v '#') ; do echo "$i: " ; ssh root@$i '(su - tahoe -c "tahoe debug find-shares hvr24tfh6iju5vimdgrzlpm7pa ~tahoe/.tahoe/")' ; done
(This found the four available shares for the file with the lost share.)