"backup" behavior and corrupted file

Sun Jul 26 19:20:20 UTC 2015

On 7/1/15 10:57 PM, droki wrote:

> In fact, this behavior isn't limited to "backup", when I run "tahoe
> check URI:CHK:..." on the URI in question I get the same result - it
> just hangs.

Huh. So, FYI, "tahoe backup" will (sometimes) perform a "tahoe check" on
a file that it's backed up in some previous run. The decision is
probabilistic. If the file was last uploaded or checked within 4 weeks,
there's a 0% chance that it will be checked. That probability rises
linearly until it hits 100% at 8 weeks or more. If the file-check shows
problems, the file is re-uploaded. The idea is that a recently-uploaded
file is probably fine, and checking all the time would be a waste of
time, but we get less certain as it gets older.

If there's something broken with the checker (and the uploaded shares of
that file), and this try-to-avoid-a-re-upload filecheck hangs, that'll
hang the overall upload. It might be an unusual response from a storage
server (answering one request, then hanging on a subsequent one),
combined with a bad assumption in the downloader/checker logic.

You can capture some log data from your client/gateway node with the
"Report an Incident" button on the main webui Welcome Page (usually at
http://localhost:3456/). The Tahoe node records log events into a
circular buffer in memory all the time, including notes about requests
sent and responses received. When something seriously weird happens,
these events are archived into a bundle called a "flogfile", and written
to disk. The "Report an Incident" button triggers a level=WEIRD log
event, creating a flogfile after 5 or 10 seconds. These files are stored
in $NODEDIR/logs/incidents/ (with a timestamp in the filename) and can
be read by a Foolscap tool named "flogtool".

When you get a chance, try this: restart the node, wait a few seconds,
run that "tahoe check" command, let it hang for a few seconds, then
trigger an incident. Grab the incident file and find a way to get it to
us.. find me ("warner") or zooko or daira on IRC, maybe transfer the
file to us[1]. You might attach the flogfile to the ticket, but there
are some lingering questions of how much information is revealed in
these files (it shouldn't have any filenames or secrets, but we might
have missed one somewhere), so I'd understand if you'd rather not.

>From the flogfile we should be able to see which queries were sent to
the servers, what their responses were, and at what point the checker
state machine stopped making progess.

As zooko mentioned, another) option would be to connect your node to a
log-gatherer process run by one of us. That'd basically deliver all the
log events, in real-time, to the gatherer. To configure this you just
add a "[node] log_gatherer.furl=" line in the tahoe.cfg file pointing to
the gatherer, and restart the node. The flogfile/incident approach is
probably easier, though.

hope that helps,
 -Brian

[1]: shameless plug: you could transfer it to us with a tool I just
built. "pip install magic-wormhole", then "wormhole send-file FILENAME".
It creates a PAKE-based "wormhole code" which you can DM to someone, who
can then run "wormhole receive-file CODE" to securely transfer the file.