#1875 new defect

Hanging on dead reference?

Reported by: nejucomo Owned by: davidsarah
Priority: normal Milestone: undecided
Component: code Version: 1.9.0-s3branch
Keywords: hang crash subprocess foolscap s3-backend LeastAuthority.com Cc:
Launchpad Bug:

Description

Symptoms:

I left a tahoe backup --verbose $LOCAL_PATH tahoe:$GRID_PATH process running last night. This was a child of a logging script I wrote called logcmd; please see Footnote 1 below for important stdio buffering details.

  1. In the morning, the output appeared to be hanging, but I wasn't certain.
  2. In a separate terminal, I ran tahoe ls tahoe:. It appeared to hang.
  3. I killed it with ^C then reran it, and it appeared to hang, so I killed that.
  4. I examined the backup process terminal to see no updates. [Footnote 2]
  5. I ran tahoe list-aliases, to verify that does not hang.

After those steps, I did these things, but do not remember the order:

  • I ran tahoe ls tahoe: a third time and it gave an UnrecoverableFileError.
  • I examined the backup terminal to see an embedded exception (inside a string literal of another exception) mentioning UploadUnhappinessError, PipelineError, and DeadReferenceError.

After all of the above, I tried tahoe ls alias: again and it immediately gave the correct listing of names I expected.

Hypothesis 1:

In this case, all of the following are hypothesized to be true:

  • The stdio buffering and process management scheme of logcmd (see Footnote 1) kept an exception traceback in memory instead of flushing it to the terminal.
  • Also, logcmd did not detect that the backup process had exited. (Otherwise it would flush the output.)
  • Some networking issue triggered the exception in the tahoe backup process.
  • The same networking issue caused the first two tahoe ls processes to hang.
  • The same networking issue, or a slightly different networking issue caused the third invocation of tahoe ls to exit with the UnrecoverableFileError.
  • The networking issue or issues (possible more than one distinct networking state) resolved, and the fourth tahoe ls invocation succeeded.

This hypothesis would fit especially if my laptop disabled networking after a period of inactivity, or if the network was disabled by an access point and my laptop did not automatically renew a dhcp lease, and when I started poking it in the morning it resumed networking.

One mark of evidence against this is that I had successfully browsed for a bit before the above commands.

Hypothesis 2:

Assume the following:

  • logcmd (see Footnote 1) did not hold onto exception output for any notable period of time, but flushed the traceback soon after it was generated.
  • running tahoe ls was related, or even a cause of, the exception in the tahoe backup process.
  • some networking condition in tahoe or foolscap will not timeout on its own, but requires other activity before an exception is triggered.

If logcmd did not introduce stdio buffering problems, then it seems unlikely that the tahoe backup exception would have appeared *just as* I was running tahoe ls commands, given that it had been running for ~6 hours.

In other words, there's a strong correlation between the tahoe ls invocations and the tahoe backup exception. The hypothesis is that the former somehow trigger the latter.

The last bullet-point implies that some kinds of networking errors (maybe DeadReferenceError or something about pipelining) do not time out, but instead require some other activity before an exception is raised. If this hypothesis is true, I consider this a bug.

Footnote 1: The backup process was a child of a logging-utility python script I wrote, named logcmd which generally has these features (sorry no public source yet):

  • It uses subprocess.Popen(...).
  • It loops on Popen.poll and continues running as long as there is no child returncode or as long as that code indicates the child has been stopped or resumed. (See os.WIF* docs)
  • Inside the loop it does a select() on the child's stdout and stderr with no timeout.
  • It then reads using file.readline on the select returns.
  • It buffers read data, then it splits on '\n' and writes all but the last chunk of the split (which may be an incomplete line) to multiple destinations.

Footnote 2: This was based on a belief that list-aliases does no networking and I wanted to distinguish between networking errors or some more general tahoe hanging issue.

Change History (6)

comment:1 Changed at 2012-11-22T15:15:02Z by nejucomo

Commands and Output:

Here are some sanitized cut'n'pastes of the commands described above.

tahoe backup

The end of the stderr stream of tahoe backup is:

Traceback (most recent call last):
  File "/home/n/virtualenvs/default/bin/tahoe", line 8, in <module>
    load_entry_point('allmydata-tahoe==1.9.2', 'console_scripts', 'tahoe')()
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/cli.py", line 569, in backup
    rc = tahoe_backup.backup(options)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks\x0a    current.result = callback(current.result, *args, **kw)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/util/pipeline.py\", line 121, in _call_finished\x0a    d.errback(self.failure)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 388, in errback\x0a    self._startRunCallbacks(fail)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks\x0a    self._runCallbacks()\x0a--- <exception caught here> ---\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks\x0a    current.result = callback(current.result, *args, **kw)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/immutable/encode.py\", line 501, in _remove_shareholder\x0a    raise UploadUnhappinessError(msg)\x0aallmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.ipb.DeadReferenceError'>: Connection was lost (to tubid=(*REDACTED*)) (during method=RIBucketWriter:write)\x0a])>\x0a]\x0a"

Using python to print the packed string literal shows:

Traceback (most recent call last):
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/util/pipeline.py", line 121, in _call_finished
    d.errback(self.failure)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 388, in errback
    self._startRunCallbacks(fail)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/immutable/encode.py", line 501, in _remove_shareholder
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.ipb.DeadReferenceError'>: Connection was lost (to tubid=(*REDACTED*)) (during method=RIBucketWriter:write)
])>
]

tahoe ls

The first two invocations just show a KeyboardInterrupt where I ^C'd them. The third invocation looks like this:

$ tahoe ls tahoe:
Error during GET: 410 Gone
UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

The fourth invocation looks like:

$ tahoe ls tahoe:
(*REDACTED: The names listed are what I expect. *)

comment:2 Changed at 2012-11-22T17:09:26Z by nejucomo

I've published the source code of logcmd to a github project. The version actually used in this ticket is linked here.

There are no unit tests yet, and I've discovered one bug: Buffered output may not be written when a process exits.

comment:3 Changed at 2012-11-22T23:32:58Z by davidsarah

Was this on your LeastAuthority account? If so, the storage server crashed at around 03:19 UTC (2012-11-22), and I restarted it at 04:39 UTC.

I suspect that the storage server crashed because it was out of memory for processing a large file. It is possible that the reads performed by the tahoe ls took memory additional to that needed for the backup, pushing the total usage over the threshold that caused a crash (or alternatively, the backup may just have happened to process a large file at about the time of the ls).

comment:4 Changed at 2012-11-22T23:34:02Z by davidsarah

tahoe list-aliases indeed does no networking.

comment:5 Changed at 2012-12-11T22:16:57Z by davidsarah

  • Component changed from unknown to code
  • Keywords hang crash subprocess foolscap s3-backend lae added
  • Version changed from 1.9.2 to 1.9.0-s3branch

comment:6 Changed at 2013-01-04T20:33:06Z by zooko

  • Keywords LeastAuthority.com added; lae removed
Note: See TracTickets for help on using tickets.