Opened at 2010-11-17T22:35:42Z
Closed at 2010-11-27T21:39:43Z
#1262 closed defect (fixed)
'tahoe start -m' no longer works
Reported by: | warner | Owned by: | zooko |
---|---|---|---|
Priority: | major | Milestone: | 1.8.1 |
Component: | code-nodeadmin | Version: | 1.8.0 |
Keywords: | reviewed regression | Cc: | |
Launchpad Bug: |
Description
The recent import-and-call-twistd change (ac3b26ecf29c08cb) unfortunately breaks the -m/--multiple option, which is what lets me do ./bin/tahoe start -m ../MY-TESTNET/node-* to spin up a 5-node local test grid in a single command. It also breaks tahoe restart -m, which I use even more frequently as I iterate over a fix.
The symptom is that the first node-startup terminates the 'tahoe start' command, so that none of the subsequent nodes are started. The call to twistd.run() is terminating the program, so it never makes it to the second pass through the loop.
I really prefer import-and-call over fork-and-exec, so I want to keep it and find some other way to fix -m. The fix will probably be to make the -m path use fork() just inside the loop over all target directories, something like:
for d in args: if os.fork() == 0: os.chdir(d) setup_sys_argv() twistd.run() # this line is never reached else: # we're the parent, just keep looping pass os.exit(0)
One downside is that any errors which occur in twistd.run() (like ImportErrors, or problems in tahoe.cfg) will be eaten by the child, so the overall 'tahoe start' exit code won't reflect the error. The stderr would still be emitted (although interleaved painfully). Another potential downside is that I've heard of one OS (freebsd? OS-X?) which allows fork()-then-exec() but forbids fork()-then-anything-else, and this would fail badly on such a system.
Attachments (4)
Change History (26)
Changed at 2010-11-17T22:50:39Z by warner
comment:1 Changed at 2010-11-17T22:55:08Z by warner
That patch only uses fork() when there are multiple targets, and doesn't use fork on the last target (i.e. it calls twistd.run() inline, as it did before). So it ought to behave exactly the same when start/restart is invoked with just a single target. It seems to work for my testgrid.
Writing a unit test might be a slight nuisance: I think it'd require a fork or a spawn within the test case. But we're probably already doing that somewhere, so maybe we've got a good starting point.
comment:2 Changed at 2010-11-18T02:18:28Z by zooko
- Keywords test-needed regression added
- Milestone changed from undecided to 1.8.1
test needed! regression!
comment:3 Changed at 2010-11-20T06:23:47Z by zooko
This is a regression from 1.8.0, and a serious one, but there is no unit test. I will see if I can figure out how to write a unit test for this.
comment:4 Changed at 2010-11-20T07:39:28Z by zooko
Ran out of time for tonight. If anybody wants to have a crack at unit-testing this patch, that would be great!
comment:5 Changed at 2010-11-20T21:11:53Z by davidsarah
os.fork() is unsupported on Windows. Either we need to:
- spawn a subprocess, e.g. using twisted.internet.utils.getProcessOutputAndValue, which does work on Windows; or
- document that -m is only supported on other platforms (which would still be a regression relative to 1.8.0) and remove that option from the help on Windows.
comment:6 Changed at 2010-11-20T21:14:23Z by davidsarah
Doc for twisted.internet.utils: http://twistedmatrix.com/documents/10.1.0/api/twisted.internet.utils.html . We use it already in test_runner.
comment:7 Changed at 2010-11-20T23:18:16Z by zooko
I would be reluctant to adopt a patch which makes the functionality different on different platforms.
comment:8 Changed at 2010-11-22T14:53:49Z by zooko
comment:9 Changed at 2010-11-22T17:33:26Z by warner
I'm cool with leaving this regression until later.
FWIW, I've been happy with the os.fork() patch locally. Also, -m is really just for developers (since normal users should only be running one node per disk, and -m is really just a shortcut to spin up a whole local testgrid at once). I'd be surprised (but pleased) if anyone other than me had ever used it. So documenting that -m doesn't work on non-os.fork() platforms would be fine with me (probably best done by conditionally inserting the argument into the Usage class depending upon hasattr(os, "fork")).
comment:10 Changed at 2010-11-24T06:17:48Z by warner
Incidentally, it was OS-X (10.6 "Snow Leopard") which sometimes complains about fork-without-immediate-exec. See here (in particular the part that says "Warning: When launching separate processes using the fork function, you must always follow a call to fork with a call to exec or a similar function. Applications that depend on the Core Foundation, Cocoa, or Core Data frameworks (either explicitly or implicitly) must make a subsequent call to an exec function or those frameworks may behave improperly."). Also see the Twisted discussion here and maybe the Adium discussion here. Something to do with protecting apps against weird thread behavior.
I think the consensus is that programs launched from the Dock under 10.6 are not allowed to use most OS-X-specific APIs (Cocoa, Quartz, etc) after a fork but before an exec. Fortunately, Tahoe isn't launched from the Dock (until someone builds a launcher app), and I don't think it's using any of those OS-X APIs at all.
So I don't think we need to worry about this. I've been testing the patch on my OS-X 10.6 box without problems.
Changed at 2010-11-25T08:13:59Z by warner
better patch: includes test, removes -m from --help when unavailable
comment:11 Changed at 2010-11-25T08:14:18Z by warner
- Keywords review-needed added; test-needed removed
comment:12 Changed at 2010-11-26T05:23:18Z by davidsarah
- Keywords reviewed added; review-needed removed
_node_has_started -> _nodes_have_started. Otherwise LGTM.
comment:13 Changed at 2010-11-26T23:17:17Z by warner
- Resolution set to fixed
- Status changed from new to closed
Cool, pushed in f3adb037ae0d22eb. One thing I caught at the last minute: the new test needed a distinct workdir ("test_multiple_clients" instead of the cut-and-paste leftover "test_client"), otherwise the two tests could interfere with each other. (I observed a test_multiple_clients failure at the very end, where it threw an exception while trying to delete a non-existent hotline file: the whole c2/ directory was missing).
comment:14 Changed at 2010-11-27T01:17:26Z by warner
- Resolution fixed deleted
- Status changed from closed to reopened
Drat, there was a problem with the windows buildslave (--multiple was being removed for 'tahoe create-client' too, which can handle it just fine), and the fix was getting messy, so I reverted that patch in f36bda278014589a.
If David-Sarah's work to come up with an os.fork replacement for windows works out, that will make things much easier (the actual 'tahoe start' code is working fine, it's just the "don't advertise --multiple on platforms that can't accept it" code that broke).
If not, I'll look into refactoring BasedirMixin to make this easier to get right. For reference, we can accept --multiple on all platforms for tahoe create-client and other create-node commands. We do not accept --multiple ever for tahoe run. And we accept --multiple only on platforms that have os.fork (i.e. on non-windows) for tahoe start and tahoe restart.
comment:15 Changed at 2010-11-27T01:20:12Z by warner
For reference, here's one of the windows buildslave errors which happened with f3adb037ae0d22eb:
[ERROR]: allmydata.test.test_runner.CreateNode.test_client Traceback (most recent call last): File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 281, in test_client self.do_create("client") File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 200, in do_create rc, out, err = self.run_tahoe(argv) File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 189, in run_tahoe rc = runner.runner(argv, stdout=out, stderr=err) File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\runner.py", line 65, in runner config.parseOptions(argv) File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 231, in parseOptions self.subOptions.parseOptions(rest) File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 237, in parseOptions self.parseArgs(*args) File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\common.py", line 80, in parseArgs if self.allow_multiple and self['multiple']: exceptions.KeyError: 'multiple'
comment:16 Changed at 2010-11-27T08:07:06Z by warner
Sigh. Alright, Zooko has convinced me that keeping -m working (at least for tahoe start -m) is not worth needing to manage different --help texts for windows. I've agreed to remove -m support and write a local shell script or something to do the equivalent function (zooko suggests function tahoestartm { for NODE in ${*} ; do tahoe start $NODE ; done } ).
I'll be bitter about it, though :-). I should keep a list of useful features that I've had to remove because they weren't easy to implement under windows, especially when nobody would ever have noticed that they were missing under windows. I'm pretty sure that I'm the only person who has ever used tahoe start -m, and it will take me some time to retrain my fingers to type something else.
The code will be cleanest if we remove -m from everything (did you know it was accepted by tahoe create-client, tahoe create-keygen, and tahoe create-stats-gatherer? yeah, didn't think so :-). I don't think anyone will miss it.
I'll put together a patch to remove -m from everything and attach it here.
Changed at 2010-11-27T09:08:19Z by warner
remove --multiple/-m from all CLI commands, clean up basedir processing
comment:17 Changed at 2010-11-27T09:13:19Z by warner
- Keywords review-needed added; reviewed removed
comment:18 Changed at 2010-11-27T17:27:56Z by secorp
I use the -m option frequently from the command line to start/stop/create-client etc. I can get around it, but I too will briefly mourn its passing.
comment:19 Changed at 2010-11-27T21:23:00Z by zooko
- Owner set to zooko
- Status changed from reopened to new
comment:20 Changed at 2010-11-27T21:23:07Z by zooko
- Status changed from new to assigned
comment:21 Changed at 2010-11-27T21:25:50Z by zooko
- Keywords reviewed added; review-needed removed
Looks good! Hooray for less code!
comment:22 Changed at 2010-11-27T21:39:43Z by Brian Warner <warner@…>
- Resolution set to fixed
- Status changed from assigned to closed
In 69b42c6cb7ec5d4a:
Changed at 2010-11-28T07:59:28Z by warner
shell script to replace "-m": run like "many ./bin/tahoe start ../GRID/node*"
preliminary patch: use fork() to start/restart all nodes but the last