[tahoe-dev] Help uploading when file exists but needs repair

Thu Dec 2 00:37:30 UTC 2010

On 11/30/10 10:58 PM, Kyle Markley wrote:
> Brian et al,
> 
>> Huh? Shouldn't the new upload just put new shares in place? I know
>> our uploader isn't particularly clever in the face of existing shares
>> (it will put multiple shares on one server, and in general not
>> achieve the ideal diversity), but it shouldn't just fail.
> 
> Ok; maybe I'm misunderstanding the failure. Let's do a more robust
> diagnosis.

Huh, yeah, I think you've correctly identified the problem. That upload
should have succeeded.

Let's see, first, to confirm that you're doing a check on the right
file, start by looking at your webapi server's "Recent Uploads and
Downloads" right after the 'tahoe backup' finishes. The most recent
upload ought to be marked as failed, and should indicate a size equal to
what 'ls -l /storage/_buildbot/.login' tells you. Then do a 'tahoe put
/storage/_buildbot/login' which ought to fail in the same way, and also
add an even more recent entry on that Recent page.

Then compare the "Storage Index" listed on that page with the output of
your "tahoe check" command (i.e. "dumi26otgmnemrypt3zlesxm5y").

> allmydata.interfaces.UploadUnhappinessError: shares could be placed on
> only 3 server(s) such that any 2 of them have enough shares to recover
> the file, but we were asked to place shares on at least 4 such
> servers. (placed all 4 shares, want to place shares on at least 4
> servers such that any 2 of them have enough shares to recover the
> file, sent 4 queries to 4 peers, 4 queries placed some shares, 0
> placed none (of which 0 placed none due to the server being full and 0
> placed none due to an error))

> So it appears it's failing to upload the .login file.  The specific
> error message doesn't make sense to me -- if all 4 queries placed some
> shares, and 0 queries placed none, then why hasn't the file become healthy?

There are two confusing things going on here. The first is that I think
(but I'd have to check the code to be sure) the "4 queries placed some
shares" message is including any "I already have a share" responses. The
second is that the UploadUnhappinessError criteria is more strict than
simply getting all four shares into the grid: it wants the arrangement
of those shares to meet the "servers-of-happiness" criteria. The "at
least 4 such servers" means s-o-h (aka tahoe.cfg's misnamed
"shares.happy") is equal to 4.

Uploading consists of two phases: share placement, then share upload. If
the proposed arrangement that comes out of the placement phase does not
meet the s-o-h criteria, the upload stops before any shares are placed.

The share-placement algorithm is usually expecting the
file-doesn't-exist-in-grid-yet case. It sends "please accept share X
(and by the way do you have any other shares?)" messages to each server
in permuted order, all in parallel (I think), with shnums chosen to get
exactly one share per server if everything goes well (i.e. each server
accepts the share offered it, and no preexisting shares were found).

I'm suspecting that something in the share-placement algorithm is
getting stuck: the particular placement of preexisting shares and the
order in which the queries are being sent/received is causing the
placement algorithm to terminate, but which doesn't result in an
arrangement that will pass the s-o-h test.

David-Sarah, you know more than I do about s-o-h and the new placement
algorithm.. could you take a look? Given the serverids and SI described
here, I think the permuted order should have been (xxaj,juwm,vjqc,47cs),
but I'd like to confirm that (maybe with a flog trace), because I can't
make that order fit with the other evidence.

Kyle: after confirming that you're checking the correct file (using the
hints above.. if it turns out we've got the wrong one, please post a new
'tahoe check --raw'), please try to upload that file while capturing the
event log:

 1: make sure you have the 'flogtool' executable (from Foolscap)
    available
 2: start your tahoe node, let it connect to everybody
 3: start "flogtool tail -s flog.out $NODEDIR/private/logport.furl" .
    That will print events to stdout as well as saving them to the
    flog.out file. Leave it running.
 4: in another shell, use 'tahoe put /storage/_buildbot/.login' to
    upload the file, which should fail as before
 5: wait a few seconds, then terminate the "flogtool" process
 6: confirm that you got the trace with "flogtool dump flog.out"
 7: compress and attach flog.out in an email or to a ticket

That should give us some useful information on what decisions the
share-placement algorithm made.

>   "sharemap": {
>    "0": [
>     "juwmgssmwnhrhfdcpxxmrz3bghh37esx"
>    ],
>    "1": [
>     "vjqcroalrgmft66mgiwfjug667fl6qjd"
>    ],
>    "3": [
>     "juwmgssmwnhrhfdcpxxmrz3bghh37esx"
>    ]
>   },
>   "servers-responding": [
>    "vjqcroalrgmft66mgiwfjug667fl6qjd",
>    "juwmgssmwnhrhfdcpxxmrz3bghh37esx",
>    "47cslusczp3uu2kygodi3nlalcruscif",
>    "xxaj2tgmnl7debjdpn4mgv2oks6pjjnx"
>   ],
>   "count-good-share-hosts": 2,
>   "count-shares-good": 3,
>   "recoverable": true

>  "storage-index": "dumi26otgmnemrypt3zlesxm5y",
>  "summary": "Not Healthy: 3 shares (enc 2-of-4)"

That says that this file is recoverable, but two of the shares are
doubled up:

 juwm: sh0, sh3
 vjqc: sh1

With 2-of-4 encoding and s-o-h=4, we need to have at least one share on
all four servers. But if the algorithm sees those two shares on juwm and
stops early (perhaps after allocating one share on xxaj and noticing the
other three shares), then (xxaj=sh2,juwm=sh0+sh3,vjqc=sh1) won't satisfy
the s-o-h test.

So, next steps:
 1: confirm we're looking at the right file
 2: get a flog trace of the failing upload
 3: get David-Sarah or somebody who understands the current
    share-placement to take a look

thanks!
 -Brian