#1262 closed defect (fixed)

'tahoe start -m' no longer works

Reported by: warner Owned by: zooko
Priority: major Milestone: 1.8.1
Component: code-nodeadmin Version: 1.8.0
Keywords: reviewed regression Cc:
Launchpad Bug:

Description

The recent import-and-call-twistd change (ac3b26ecf29c08cb) unfortunately breaks the -m/--multiple option, which is what lets me do ./bin/tahoe start -m ../MY-TESTNET/node-* to spin up a 5-node local test grid in a single command. It also breaks tahoe restart -m, which I use even more frequently as I iterate over a fix.

The symptom is that the first node-startup terminates the 'tahoe start' command, so that none of the subsequent nodes are started. The call to twistd.run() is terminating the program, so it never makes it to the second pass through the loop.

I really prefer import-and-call over fork-and-exec, so I want to keep it and find some other way to fix -m. The fix will probably be to make the -m path use fork() just inside the loop over all target directories, something like:

for d in args:
  if os.fork() == 0:
    os.chdir(d)
    setup_sys_argv()
    twistd.run()
    # this line is never reached
  else:
    # we're the parent, just keep looping
    pass
  os.exit(0)

One downside is that any errors which occur in twistd.run() (like ImportErrors, or problems in tahoe.cfg) will be eaten by the child, so the overall 'tahoe start' exit code won't reflect the error. The stderr would still be emitted (although interleaved painfully). Another potential downside is that I've heard of one OS (freebsd? OS-X?) which allows fork()-then-exec() but forbids fork()-then-anything-else, and this would fail badly on such a system.

Attachments (4)

start-m.diff (2.0 KB) - added by warner at 2010-11-17T22:50:39Z.
preliminary patch: use fork() to start/restart all nodes but the last
start-m-with-test.diff (10.5 KB) - added by warner at 2010-11-25T08:13:59Z.
better patch: includes test, removes -m from --help when unavailable
remove-m.diff (9.5 KB) - added by warner at 2010-11-27T09:08:19Z.
remove --multiple/-m from all CLI commands, clean up basedir processing
many (108 bytes) - added by warner at 2010-11-28T07:59:28Z.
shell script to replace "-m": run like "many ./bin/tahoe start ../GRID/node*"

Download all attachments as: .zip

Change History (26)

Changed at 2010-11-17T22:50:39Z by warner

preliminary patch: use fork() to start/restart all nodes but the last

comment:1 Changed at 2010-11-17T22:55:08Z by warner

That patch only uses fork() when there are multiple targets, and doesn't use fork on the last target (i.e. it calls twistd.run() inline, as it did before). So it ought to behave exactly the same when start/restart is invoked with just a single target. It seems to work for my testgrid.

Writing a unit test might be a slight nuisance: I think it'd require a fork or a spawn within the test case. But we're probably already doing that somewhere, so maybe we've got a good starting point.

comment:2 Changed at 2010-11-18T02:18:28Z by zooko

  • Keywords test-needed regression added
  • Milestone changed from undecided to 1.8.1

test needed! regression!

comment:3 Changed at 2010-11-20T06:23:47Z by zooko

This is a regression from 1.8.0, and a serious one, but there is no unit test. I will see if I can figure out how to write a unit test for this.

comment:4 Changed at 2010-11-20T07:39:28Z by zooko

Ran out of time for tonight. If anybody wants to have a crack at unit-testing this patch, that would be great!

comment:5 Changed at 2010-11-20T21:11:53Z by davidsarah

os.fork() is unsupported on Windows. Either we need to:

  • spawn a subprocess, e.g. using twisted.internet.utils.getProcessOutputAndValue, which does work on Windows; or
  • document that -m is only supported on other platforms (which would still be a regression relative to 1.8.0) and remove that option from the help on Windows.

comment:6 Changed at 2010-11-20T21:14:23Z by davidsarah

Doc for twisted.internet.utils: http://twistedmatrix.com/documents/10.1.0/api/twisted.internet.utils.html . We use it already in test_runner.

comment:7 Changed at 2010-11-20T23:18:16Z by zooko

I would be reluctant to adopt a patch which makes the functionality different on different platforms.

comment:8 Changed at 2010-11-22T14:53:49Z by zooko

The 1.8.1 release is past due, and it is blocked by this issue and by #1264.

Perhaps we should go ahead and release 1.8.1 as-is, with the -m option broken (and, re #1264, with download performance when K is large worse than v1.7).

comment:9 Changed at 2010-11-22T17:33:26Z by warner

I'm cool with leaving this regression until later.

FWIW, I've been happy with the os.fork() patch locally. Also, -m is really just for developers (since normal users should only be running one node per disk, and -m is really just a shortcut to spin up a whole local testgrid at once). I'd be surprised (but pleased) if anyone other than me had ever used it. So documenting that -m doesn't work on non-os.fork() platforms would be fine with me (probably best done by conditionally inserting the argument into the Usage class depending upon hasattr(os, "fork")).

comment:10 Changed at 2010-11-24T06:17:48Z by warner

Incidentally, it was OS-X (10.6 "Snow Leopard") which sometimes complains about fork-without-immediate-exec. See here (in particular the part that says "Warning: When launching separate processes using the fork function, you must always follow a call to fork with a call to exec or a similar function. Applications that depend on the Core Foundation, Cocoa, or Core Data frameworks (either explicitly or implicitly) must make a subsequent call to an exec function or those frameworks may behave improperly."). Also see the Twisted discussion here and maybe the Adium discussion here. Something to do with protecting apps against weird thread behavior.

I think the consensus is that programs launched from the Dock under 10.6 are not allowed to use most OS-X-specific APIs (Cocoa, Quartz, etc) after a fork but before an exec. Fortunately, Tahoe isn't launched from the Dock (until someone builds a launcher app), and I don't think it's using any of those OS-X APIs at all.

So I don't think we need to worry about this. I've been testing the patch on my OS-X 10.6 box without problems.

Changed at 2010-11-25T08:13:59Z by warner

better patch: includes test, removes -m from --help when unavailable

comment:11 Changed at 2010-11-25T08:14:18Z by warner

  • Keywords review-needed added; test-needed removed

comment:12 Changed at 2010-11-26T05:23:18Z by davidsarah

  • Keywords reviewed added; review-needed removed

_node_has_started -> _nodes_have_started. Otherwise LGTM.

comment:13 Changed at 2010-11-26T23:17:17Z by warner

  • Resolution set to fixed
  • Status changed from new to closed

Cool, pushed in f3adb037ae0d22eb. One thing I caught at the last minute: the new test needed a distinct workdir ("test_multiple_clients" instead of the cut-and-paste leftover "test_client"), otherwise the two tests could interfere with each other. (I observed a test_multiple_clients failure at the very end, where it threw an exception while trying to delete a non-existent hotline file: the whole c2/ directory was missing).

comment:14 Changed at 2010-11-27T01:17:26Z by warner

  • Resolution fixed deleted
  • Status changed from closed to reopened

Drat, there was a problem with the windows buildslave (--multiple was being removed for 'tahoe create-client' too, which can handle it just fine), and the fix was getting messy, so I reverted that patch in f36bda278014589a.

If David-Sarah's work to come up with an os.fork replacement for windows works out, that will make things much easier (the actual 'tahoe start' code is working fine, it's just the "don't advertise --multiple on platforms that can't accept it" code that broke).

If not, I'll look into refactoring BasedirMixin to make this easier to get right. For reference, we can accept --multiple on all platforms for tahoe create-client and other create-node commands. We do not accept --multiple ever for tahoe run. And we accept --multiple only on platforms that have os.fork (i.e. on non-windows) for tahoe start and tahoe restart.

comment:15 Changed at 2010-11-27T01:20:12Z by warner

For reference, here's one of the windows buildslave errors which happened with f3adb037ae0d22eb:

[ERROR]: allmydata.test.test_runner.CreateNode.test_client

Traceback (most recent call last):
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 281, in test_client
    self.do_create("client")
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 200, in do_create
    rc, out, err = self.run_tahoe(argv)
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 189, in run_tahoe
    rc = runner.runner(argv, stdout=out, stderr=err)
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\runner.py", line 65, in runner
    config.parseOptions(argv)
  File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 231, in parseOptions
    self.subOptions.parseOptions(rest)
  File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 237, in parseOptions
    self.parseArgs(*args)
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\common.py", line 80, in parseArgs
    if self.allow_multiple and self['multiple']:
exceptions.KeyError: 'multiple'

comment:16 Changed at 2010-11-27T08:07:06Z by warner

Sigh. Alright, Zooko has convinced me that keeping -m working (at least for tahoe start -m) is not worth needing to manage different --help texts for windows. I've agreed to remove -m support and write a local shell script or something to do the equivalent function (zooko suggests function tahoestartm { for NODE in ${*} ; do tahoe start $NODE ; done } ).

I'll be bitter about it, though :-). I should keep a list of useful features that I've had to remove because they weren't easy to implement under windows, especially when nobody would ever have noticed that they were missing under windows. I'm pretty sure that I'm the only person who has ever used tahoe start -m, and it will take me some time to retrain my fingers to type something else.

The code will be cleanest if we remove -m from everything (did you know it was accepted by tahoe create-client, tahoe create-keygen, and tahoe create-stats-gatherer? yeah, didn't think so :-). I don't think anyone will miss it.

I'll put together a patch to remove -m from everything and attach it here.

Changed at 2010-11-27T09:08:19Z by warner

remove --multiple/-m from all CLI commands, clean up basedir processing

comment:17 Changed at 2010-11-27T09:13:19Z by warner

  • Keywords review-needed added; reviewed removed

comment:18 Changed at 2010-11-27T17:27:56Z by secorp

I use the -m option frequently from the command line to start/stop/create-client etc. I can get around it, but I too will briefly mourn its passing.

comment:19 Changed at 2010-11-27T21:23:00Z by zooko

  • Owner set to zooko
  • Status changed from reopened to new

comment:20 Changed at 2010-11-27T21:23:07Z by zooko

  • Status changed from new to assigned

comment:21 Changed at 2010-11-27T21:25:50Z by zooko

  • Keywords reviewed added; review-needed removed

Looks good! Hooray for less code!

comment:22 Changed at 2010-11-27T21:39:43Z by Brian Warner <warner@…>

  • Resolution set to fixed
  • Status changed from assigned to closed

In 69b42c6cb7ec5d4a:

remove --multiple/-m option from all CLI commands: closes #1262

I personally used "tahoe start/restart -m ../MY-TESTNET/node*" all the time,
to spin up or update a local testgrid while iterating over new code. However,
with the recent switch from "subprocess.Popen(/bin/twistd)" to "import and
call twistd.run()" in scripts/startstop_node.py (yay fewer processes!),
"start -m" broke, and fixing it requires os.fork, which is unavailable on
windows (boo windows!). And I was probably the only one using -m. So in the
interests of uniformity among platforms and simpler code (yay negative code
days!), we're just removing -m from everything. I will start using a little
shell script or something to simulate the removed functionality.

This patch also cleans up CLI-function calling a bit: get the basedir from
the config dict (instead of sometimes from a separate argument), and always
return a numeric exit code.

Changed at 2010-11-28T07:59:28Z by warner

shell script to replace "-m": run like "many ./bin/tahoe start ../GRID/node*"

Note: See TracTickets for help on using tickets.