Opened at 2008-12-23T16:25:35Z
Closed at 2010-08-08T00:37:52Z
#565 closed defect (fixed)
unicode arguments on the command-line
Reported by: | zooko | Owned by: | davidsarah |
---|---|---|---|
Priority: | major | Milestone: | 1.8β |
Component: | code-frontend-cli | Version: | 1.2.0 |
Keywords: | unicode windows | Cc: | |
Launchpad Bug: |
Description
How do we know what encoding was used to encode the filenames or other arguments that are passed in via Python 2's sys.argv? If we don't know, do we assume that it is utf-8, thus making it incompatible with platforms that don't encode arguments with utf-8? Or do we leave it undecoded, thus making it impossible to correctly inspect the string for the presence of '/' chars?
Attachments (1)
Change History (18)
comment:1 Changed at 2008-12-28T00:43:18Z by francois
comment:2 Changed at 2009-12-07T04:57:53Z by davidsarah
- Keywords windows added
[Windows-only]
http://bugs.python.org/issue2128 suggests that on Python 2.6.x for Windows, any non-ASCII characters will have been irretrievably mangled to question-marks in sys.argv. Unfortunately win32api.GetCommandLine seems to call GetCommandLineA, not GetCommandLineW. The bzr project solved this problem by using ctypes to call GetCommandLineW: https://bugs.launchpad.net/bzr/+bug/375934 . (bzr is GPL'd, so we can use that code.)
Note that this would require passing the correct unicode argv into twisted.python.usage.Options.parseOptions from source:src/allmydata/scripts/runner.py , i.e. change source:windows/tahoe.py to do
argv = get_cmdline_unicode() # from bzr patch rc = runner(argv[1:], install_node_control=False) sys.exit(rc)
(assuming that twisted.python.usage.Options handles Unicode correctly, which I haven't tested).
comment:3 Changed at 2010-02-02T00:15:29Z by davidsarah
- Milestone changed from undecided to 1.7.0
Needed for #534 which has milestone 1.7.0.
comment:4 Changed at 2010-04-05T23:42:25Z by francois
- Owner set to francois
comment:5 Changed at 2010-04-30T18:24:04Z by davidsarah
Here's some code to get Unicode argv that should work on both Windows (including cygwin) and Unix. On Unix, it assumes that arguments are encoded according to the current locale encoding (or UTF-8 if that could not be determined by Python).
import sys, locale if sys.platform == "win32": from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll def get_unicode_argv(): GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32)) CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \ (("CommandLineToArgvW", windll.shell32)) argc = c_int(0) argv = CommandLineToArgvW(GetCommandLineW(), byref(argc)) return [argv[i] for i in xrange(1, argc.value)] else: def get_unicode_argv(): encoding = locale.getpreferredencoding() if not encoding: encoding = "utf-8" # This throws UnicodeError if any argument cannot be decoded. return [arg.decode(encoding, 'strict') for arg in sys.argv] print get_unicode_argv()
comment:6 Changed at 2010-05-11T15:53:36Z by zooko
I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't contribute to it myself right now.
comment:7 Changed at 2010-05-11T19:28:57Z by davidsarah
- Owner changed from francois to davidsarah
- Status changed from new to assigned
comment:8 Changed at 2010-05-11T19:29:35Z by davidsarah
- Keywords review-needed added
comment:9 Changed at 2010-06-08T08:57:35Z by davidsarah
- Keywords review-needed removed
Getting this working on Windows is more difficult than I thought. I have successfully got it to work by hacking the setuptools-generated entry script like this:
# EASY-INSTALL-ENTRY-SCRIPT: 'allmydata-tahoe==1.6.1-r4452','console_scripts','tahoe' __requires__ = 'allmydata-tahoe==1.6.1-r4452' import sys from pkg_resources import load_entry_point ### start extra code from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32)) CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \ (("CommandLineToArgvW", windll.shell32)) argc = c_int(0) argv = CommandLineToArgvW(GetCommandLineW(), byref(argc)) sys.argv = [argv[i].encode('utf-8') for i in xrange(1, argc.value)] ### end extra code sys.exit( load_entry_point('allmydata-tahoe==1.6.1-r4452', 'console_scripts', 'tahoe')() )
but only by invoking this script directly from the command line, not via the tahoe.exe wrapper. The latter mangles the arguments beyond hope of recovery.
comment:10 Changed at 2010-06-08T18:25:38Z by davidsarah
It isn't necessary for the extra code to be in the entry script; it could be in source:allmydata/scripts/runner.py . However, Zooko and I decided that changing how the CLI entry works on Windows would be too disruptive for 1.7, so we're dropping support for Unicode args on Windows until the next release.
This ticket is fixed for other platforms in 1.7.
Changed at 2010-06-09T00:20:35Z by davidsarah
Back out Windows-specific Unicode argument support for v1.7.
comment:11 Changed at 2010-06-09T00:21:06Z by davidsarah
- Keywords review-needed added
- Owner changed from davidsarah to zooko
- Status changed from assigned to new
comment:12 Changed at 2010-06-09T02:28:57Z by zooko
- Keywords reviewed added; review-needed removed
- Owner changed from zooko to davidsarah
The patch looks correct.
comment:13 Changed at 2010-06-12T20:48:23Z by davidsarah
- Keywords reviewed removed
- Milestone changed from 1.7.0 to 1.7.1
- Status changed from new to assigned
back-out-windows-specific-unicode-argv.dpatch was applied in 32d9deace3d82637.
See #1074 for a patch that reenables Unicode argument support on Windows (but requires further discussion and refinement).
comment:14 Changed at 2010-07-14T02:44:22Z by davidsarah
The #1074 patch is now finished.
comment:15 Changed at 2010-07-17T03:50:28Z by davidsarah
- Milestone changed from 1.7.1 to 1.8β
comment:16 Changed at 2010-08-02T07:23:26Z by david-sarah@…
In [4627/ticket798]:
comment:17 Changed at 2010-08-08T00:37:52Z by davidsarah
- Resolution set to fixed
- Status changed from assigned to closed
Fixed; see ticket:1074#comment:29 for changesets.
As a data point, here's how it is handled in Python 3.0.
Some system APIs like os.environ and sys.argv can also present problems when the bytes made available by the system is not interpretable using the default encoding. Setting the LANG variable and rerunning the program is probably the best approach.
Source: What's new in Python 3.0
We should probably implement something working in a similair way for python 2.