[tahoe-lafs-trac-stream] [tahoe-lafs] #1875: Hanging on dead reference?

Thu Nov 22 15:07:10 UTC 2012

#1875: Hanging on dead reference?
----------------------+----------------------------
 Reporter:  nejucomo  |          Owner:  davidsarah
     Type:  defect    |         Status:  new
 Priority:  normal    |      Milestone:  undecided
Component:  unknown   |        Version:  1.9.2
 Keywords:            |  Launchpad Bug:
----------------------+----------------------------
 **Symptoms**:

 I left a {{{tahoe backup --verbose $LOCAL_PATH tahoe:$GRID_PATH}}} process
 running last night.  This was a child of a logging script I wrote called
 {{{logcmd}}}; please see Footnote 1 below for important stdio buffering
 details.

 1. In the morning, the output appeared to be hanging, but I wasn't
 certain.
 1. In a separate terminal, I ran {{{tahoe ls tahoe:}}}.  It appeared to
 hang.
 1. I killed it with {{{^C}}} then reran it, and it appeared to hang, so I
 killed that.
 1. I examined the backup process terminal to see no updates. [Footnote 2]
 1. I ran {{{tahoe list-aliases}}}, to verify that does not hang.

 After those steps, I did these things, but do not remember the order:
 * I ran {{{tahoe ls tahoe:}}} a third time and it gave an
 {{{UnrecoverableFileError}}}.
 * I examined the {{{backup}}} terminal to see an embedded exception
 (inside a string literal of another exception) mentioning
 {{{UploadUnhappinessError}}}, {{{PipelineError}}}, and
 {{{DeadReferenceError}}}.

 After all of the above, I tried {{{tahoe ls alias:}}} again and it
 immediately gave the correct listing of names I expected.

 **Hypothesis 1**:

 In this case, all of the following are hypothesized to be true:

 * The stdio buffering and process management scheme of {{{logcmd}}} (see
 Footnote 1) kept an exception traceback in memory instead of flushing it
 to the terminal.
 * Also, {{{logcmd}}} did not detect that the {{{backup}}} process had
 exited.  (Otherwise it would flush the output.)
 * Some networking issue triggered the exception in the {{{tahoe backup}}}
 process.
 * The same networking issue caused the first two {{{tahoe ls}}} processes
 to hang.
 * The same networking issue, or a slightly different networking issue
 caused the third invocation of {{{tahoe ls}}} to exit with the
 {{{UnrecoverableFileError}}}.
 * The networking issue or issues (possible more than one distinct
 networking state) resolved, and the fourth {{{tahoe ls}}} invocation
 succeeded.

 This hypothesis would fit especially if my laptop disabled networking
 after a period of inactivity, or if the network was disabled by an access
 point and my laptop did not automatically renew a dhcp lease, //and// when
 I started poking it in the morning it resumed networking.

 One mark of evidence against this is that I had successfully browsed for a
 bit before the above commands.

 **Hypothesis 2**:

 Assume the following:

 * {{{logcmd}}} (see Footnote 1) did not hold onto exception output for any
 notable period of time, but flushed the traceback soon after it was
 generated.
 * running {{{tahoe ls}}} was related, or even a cause of, the exception in
 the {{{tahoe backup}}} process.
 * some networking condition in {{{tahoe}}} or {{{foolscap}}} will not
 timeout on its own, but requires other activity before an exception is
 triggered.

 If {{{logcmd}}} did not introduce stdio buffering problems, then it seems
 unlikely that the {{{tahoe backup}}} exception would have appeared *just
 as* I was running {{{tahoe ls}}} commands, given that it had been running
 for ~6 hours.

 In other words, there's a strong correlation between the {{{tahoe ls}}}
 invocations and the {{{tahoe backup}}} exception.  The hypothesis is that
 the former somehow trigger the latter.

 The last bullet-point implies that some kinds of networking errors (maybe
 {{{DeadReferenceError}}} or something about pipelining) do not time out,
 but instead require some other activity before an exception is raised.  If
 this hypothesis is true, I consider this a bug.

 **Footnote 1**: The {{{backup}}} process was a child of a logging-utility
 python script I wrote, named {{{logcmd}}} which generally has these
 features (sorry no public source yet):

 * It uses {{{subprocess.Popen(...)}}}.
 * It loops on {{{Popen.poll}}} and continues running as long as there is
 no child {{{returncode}}} //or// as long as that code indicates the child
 has been stopped or resumed. (See
 [http://docs.python.org/2/library/os.html#os.WIFCONTINUED os.WIF* docs])
 * Inside the loop it does a {{{select()}}} on the child's stdout and
 stderr with //no timeout//.
 * It then reads using {{{file.readline}}} on the select returns.
 * It buffers read data, then it splits on '\n' and writes all but the last
 chunk of the split (which may be an incomplete line) to multiple
 destinations.

 **Footnote 2**: This was based on a belief that {{{list-aliases}}} does no
 networking and I wanted to distinguish between networking errors or some
 more general tahoe hanging issue.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1875>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage