[tahoe-lafs-trac-stream] [tahoe-lafs] #1875: Hanging on dead reference?
tahoe-lafs
trac at tahoe-lafs.org
Thu Nov 22 15:07:10 UTC 2012
#1875: Hanging on dead reference?
----------------------+----------------------------
Reporter: nejucomo | Owner: davidsarah
Type: defect | Status: new
Priority: normal | Milestone: undecided
Component: unknown | Version: 1.9.2
Keywords: | Launchpad Bug:
----------------------+----------------------------
**Symptoms**:
I left a {{{tahoe backup --verbose $LOCAL_PATH tahoe:$GRID_PATH}}} process
running last night. This was a child of a logging script I wrote called
{{{logcmd}}}; please see Footnote 1 below for important stdio buffering
details.
1. In the morning, the output appeared to be hanging, but I wasn't
certain.
1. In a separate terminal, I ran {{{tahoe ls tahoe:}}}. It appeared to
hang.
1. I killed it with {{{^C}}} then reran it, and it appeared to hang, so I
killed that.
1. I examined the backup process terminal to see no updates. [Footnote 2]
1. I ran {{{tahoe list-aliases}}}, to verify that does not hang.
After those steps, I did these things, but do not remember the order:
* I ran {{{tahoe ls tahoe:}}} a third time and it gave an
{{{UnrecoverableFileError}}}.
* I examined the {{{backup}}} terminal to see an embedded exception
(inside a string literal of another exception) mentioning
{{{UploadUnhappinessError}}}, {{{PipelineError}}}, and
{{{DeadReferenceError}}}.
After all of the above, I tried {{{tahoe ls alias:}}} again and it
immediately gave the correct listing of names I expected.
**Hypothesis 1**:
In this case, all of the following are hypothesized to be true:
* The stdio buffering and process management scheme of {{{logcmd}}} (see
Footnote 1) kept an exception traceback in memory instead of flushing it
to the terminal.
* Also, {{{logcmd}}} did not detect that the {{{backup}}} process had
exited. (Otherwise it would flush the output.)
* Some networking issue triggered the exception in the {{{tahoe backup}}}
process.
* The same networking issue caused the first two {{{tahoe ls}}} processes
to hang.
* The same networking issue, or a slightly different networking issue
caused the third invocation of {{{tahoe ls}}} to exit with the
{{{UnrecoverableFileError}}}.
* The networking issue or issues (possible more than one distinct
networking state) resolved, and the fourth {{{tahoe ls}}} invocation
succeeded.
This hypothesis would fit especially if my laptop disabled networking
after a period of inactivity, or if the network was disabled by an access
point and my laptop did not automatically renew a dhcp lease, //and// when
I started poking it in the morning it resumed networking.
One mark of evidence against this is that I had successfully browsed for a
bit before the above commands.
**Hypothesis 2**:
Assume the following:
* {{{logcmd}}} (see Footnote 1) did not hold onto exception output for any
notable period of time, but flushed the traceback soon after it was
generated.
* running {{{tahoe ls}}} was related, or even a cause of, the exception in
the {{{tahoe backup}}} process.
* some networking condition in {{{tahoe}}} or {{{foolscap}}} will not
timeout on its own, but requires other activity before an exception is
triggered.
If {{{logcmd}}} did not introduce stdio buffering problems, then it seems
unlikely that the {{{tahoe backup}}} exception would have appeared *just
as* I was running {{{tahoe ls}}} commands, given that it had been running
for ~6 hours.
In other words, there's a strong correlation between the {{{tahoe ls}}}
invocations and the {{{tahoe backup}}} exception. The hypothesis is that
the former somehow trigger the latter.
The last bullet-point implies that some kinds of networking errors (maybe
{{{DeadReferenceError}}} or something about pipelining) do not time out,
but instead require some other activity before an exception is raised. If
this hypothesis is true, I consider this a bug.
**Footnote 1**: The {{{backup}}} process was a child of a logging-utility
python script I wrote, named {{{logcmd}}} which generally has these
features (sorry no public source yet):
* It uses {{{subprocess.Popen(...)}}}.
* It loops on {{{Popen.poll}}} and continues running as long as there is
no child {{{returncode}}} //or// as long as that code indicates the child
has been stopped or resumed. (See
[http://docs.python.org/2/library/os.html#os.WIFCONTINUED os.WIF* docs])
* Inside the loop it does a {{{select()}}} on the child's stdout and
stderr with //no timeout//.
* It then reads using {{{file.readline}}} on the select returns.
* It buffers read data, then it splits on '\n' and writes all but the last
chunk of the split (which may be an incomplete line) to multiple
destinations.
**Footnote 2**: This was based on a belief that {{{list-aliases}}} does no
networking and I wanted to distinguish between networking errors or some
more general tahoe hanging issue.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1875>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list