Context Navigation

CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem

Reported by:	francois	Owned by:	nobody
Priority:	major	Milestone:	1.7.0
Component:	code-frontend-cli	Version:	1.2.0
Keywords:	reviewed tahoe-cp unicode filename forward-compatibility gsoc	Cc:	francois@…, tahoe-dev@…
Launchpad Bug:

Description

"tahoe cp" doesn't correctly handle unicode filenames.

Steps to reproduce:

francois@korn:~/tmp$ touch Motörhead
francois@korn:~/tmp$ tahoe cp Motörhead tahoe:
Traceback (most recent call last):
  File "/home/francois/dev/tahoe/support/bin/tahoe", line 8, in <module>
    load_entry_point('allmydata-tahoe==1.2.0-r3183', 'console_scripts', 'tahoe')()
  File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 84, in run
    rc = runner(sys.argv[1:])
  File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 73, in runner
    rc = cli.dispatch[command](so)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/cli.py", line 261, in cp
    rc = tahoe_cp.copy(options)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 722, in copy
    return Copier().do_copy(options)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 478, in do_copy
    return self.copy_to_directory(sources, target)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 594, in copy_to_directory
    target.populate(False)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_cp.py", line 343, in populate
    urllib.quote(name)])
  File "/usr/lib/python2.5/urllib.py", line 1205, in quote
    res = map(safe_map.__getitem__, s)
KeyError: u'\u0129'
francois@korn:~/tmp$

Attachments (45)

cp-encoding.patch (84.5 KB) - added by francois at 2008-11-13T11:30:50Z.: Bugfix + test
fix-failing-test.2.patch (85.3 KB) - added by francois at 2008-11-14T14:04:15Z.: This patch should hopefully fix failing test on systems with LC_ALL=C
credit.patch (84.3 KB) - added by francois at 2008-11-14T14:07:56Z.: Add credit entry
test-unicode.py (205 bytes) - added by francois at 2008-11-16T00:11:47Z.: Test script for unicode filename issues
require-simplejson.patch (90.2 KB) - added by francois at 2008-11-25T17:19:39Z.: Require simplejson version >= 2.0.5
stringutils.dpatch (190.3 KB) - added by francois at 2009-02-17T11:57:45Z.
unicode.dpatch (37.1 KB) - added by francois at 2009-02-24T11:50:41Z.
unicode-v2.dpatch (39.2 KB) - added by francois at 2009-02-25T09:31:55Z.
unicode-v3.dpatch (53.6 KB) - added by francois at 2009-04-01T19:41:52Z.
unicode-v3-minus-the-stdout-parts.patch.txt (30.7 KB) - added by zooko at 2009-04-09T03:59:02Z.
unicode-v3-minus-the-stdout-and-aliases-parts.patch.txt (28.7 KB) - added by zooko at 2009-04-09T04:02:14Z.
unicode-v3-minus-the-stdout-and-aliases-and-argv-parts.patch.txt (23.0 KB) - added by zooko at 2009-04-09T04:09:10Z.
unicode-v3-minus-the-stdout-and-aliases-and-argv-and-url-parts.patch.txt (16.5 KB) - added by zooko at 2009-04-09T04:56:05Z.
fsencoding.py (2.9 KB) - added by zooko at 2009-04-28T18:11:37Z.: PEP 383'ish implementation of listdir()
fsencode.py (3.0 KB) - added by zooko at 2009-04-29T06:11:26Z.
tahoe-534-bundle.dpatch (68.8 KB) - added by francois at 2009-04-29T16:36:10Z.
fsencode.2.py (3.0 KB) - added by zooko at 2009-05-01T04:54:47Z.
fsencode.3.py (2.8 KB) - added by zooko at 2009-05-01T05:09:31Z.
plumbing for unicode support.darcspatch (11.6 KB) - added by zooko at 2009-06-09T19:49:58Z.
tahoe manifest unicode support.darcspatch (9.3 KB) - added by zooko at 2009-06-09T19:50:08Z.
tahoe ls unicode support.darcspatch (9.4 KB) - added by zooko at 2009-06-09T19:50:17Z.
tahoe get unicode support.darcspatch (8.9 KB) - added by zooko at 2009-06-09T19:51:37Z.
tahoe webopen unicode support.darcspatch (8.8 KB) - added by zooko at 2009-06-09T19:51:46Z.
tahoe rm unicode support.darcspatch (8.9 KB) - added by zooko at 2009-06-09T19:51:57Z.
tahoe ln unicode support.darcspatch (8.9 KB) - added by zooko at 2009-06-09T19:52:09Z.
tahoe status unicode support.darcspatch (8.8 KB) - added by zooko at 2009-06-09T19:52:17Z.
tahoe check unicode support.darcspatch (8.8 KB) - added by zooko at 2009-06-09T19:52:27Z.
tahoe deep-check unicode support.darcspatch (8.8 KB) - added by zooko at 2009-06-09T19:52:35Z.
tahoe mkdir unicode support.darcspatch (10.1 KB) - added by zooko at 2009-06-09T19:52:45Z.
explain the current unicode support in CLI.darcspatch (9.7 KB) - added by zooko at 2009-06-09T19:52:58Z.
tahoe cp unicode support.darcspatch (15.2 KB) - added by zooko at 2009-06-09T19:53:07Z.
tahoe put unicode support.darcspatch (11.2 KB) - added by zooko at 2009-06-09T19:53:18Z.
aliases unicode support.darcspatch (12.5 KB) - added by zooko at 2009-06-09T19:53:26Z.
previous cli related patches should.darcspatch (8.9 KB) - added by zooko at 2009-06-09T19:53:39Z.
tahoe backup unicode support.darcspatch (14.7 KB) - added by zooko at 2009-06-09T19:53:48Z.
unicode-helper-functions.diff (10.1 KB) - added by francois at 2010-04-24T00:11:27Z.: This patch contains Unicode helper functions (stringutils.py) and associated tests
unicode-filenames-handling.diff (28.6 KB) - added by francois at 2010-05-04T07:27:42Z.
unicode-helper-functions-v2.diff (12.6 KB) - added by francois at 2010-05-17T08:07:19Z.
unicode-filenames-handling-v2.diff (30.3 KB) - added by francois at 2010-05-17T08:07:44Z.
unicode-filenames-handling-v3.2.diff (30.2 KB) - added by francois at 2010-05-18T23:26:26Z.
additionnal-tests.diff (1.1 KB) - added by francois at 2010-05-19T00:41:41Z.
unicode-helper-functions-v4.diff (13.6 KB) - added by francois at 2010-05-20T00:54:48Z.
unicode-filenames-handling-v4.diff (31.7 KB) - added by francois at 2010-05-20T00:55:07Z.
unicode-bundle-v4.darcspatch (59.6 KB) - added by francois at 2010-05-20T00:55:19Z.
misc-update.dpatch (45.4 KB) - added by zooko at 2010-06-09T04:22:53Z.

Download all attachments as: .zip

Change History (167)

comment:1 Changed at 2008-11-11T20:38:51Z by francois

Attached patch seems to fix this issue.

comment:2 Changed at 2008-11-12T13:47:34Z by zooko

Cc tahoe-dev@… added

Dear Francois:

Thank you very much for the bug report and patch. It looks good to me. Is there any chance you can write a unit test for this bug? The test module for the command-line functions is here:

http://allmydata.org/trac/tahoe/browser/src/allmydata/test/test_cli.py

It looks like there are tests for get and put but not for cp in there.

Please let me know. Thanks!

--Zooko

Changed at 2008-11-13T11:30:50Z by francois

Attachment cp-encoding.patch added

Bugfix + test

comment:3 Changed at 2008-11-13T11:33:01Z by francois

This newer patch contains the bugfix itself and associated test in test_cli.py.

Thanks for your advice Zooko !

comment:4 Changed at 2008-11-13T13:52:06Z by zooko

Resolution set to fixed
Status changed from new to closed

fixed by c1f639d230bb32a1 and 5c0c5bfc81e1f0cb.

comment:5 Changed at 2008-11-13T20:32:59Z by warner

Resolution fixed deleted
Status changed from closed to reopened

cygwin, solaris, and yukyu-hardy buildslaves are all failing this test.

Also, the test should probably read back from those files (with a 'tahoe get') to make sure the cp actually worked (instead of perhaps failing silently).

comment:6 Changed at 2008-11-13T21:03:14Z by warner

the test is also failing on my personal workstation (linux, debian/sid), in the same way as the other buildslaves:

[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
  File "/home/warner/trees/tahoe/src/allmydata/test/test_cli.py", line 554, in test_unicode_filename
    open(fn1, "wb").write("unicode file content")
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 56: ordinal not in range(128)

comment:7 Changed at 2008-11-14T10:24:41Z by francois

Ok, I can reproduce this failing test on my environment (Ubuntu hardy, 64bits, LC_ALL=en_US.UTF8).

$ env -i PATH=$PATH make test TEST="allmydata.test.test_cli.Cp"

It is related to the locales environment settings.

Changed at 2008-11-14T14:04:15Z by francois

Attachment fix-failing-test.2.patch added

This patch should hopefully fix failing test on systems with LC_ALL=C

Changed at 2008-11-14T14:07:56Z by francois

Attachment credit.patch added

Add credit entry

comment:8 follow-up: ↓ 10 Changed at 2008-11-15T00:01:00Z by warner

Thanks, I'll apply those.

What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using "\xc3\x84rtonwall" instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leading u", that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk.

comment:9 Changed at 2008-11-15T00:10:35Z by warner

Oops, Zooko must have beat me to it.. those patches were already applied.

But, the buildbot is still indicating test failures: see http://allmydata.org/buildbot/waterfall . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others.

comment:10 in reply to: ↑ 8 Changed at 2008-11-16T00:11:21Z by francois

Replying to warner:

What do you think about the idea of using escaped binary strings instead of native UTF-8 in those filenames, i.e. removing the "coding=utf-8" line from the beginning of the file and using "\xc3\x84rtonwall" instead of embedding the A-with-umlaut in the string? I'm wondering if it might make our intentions clearer. At the moment we have a unicode name expressed as a non-unicode type (without the leading u", that string is just a bytestring), and the casual observer might not realize that we intended to get the binary strings into the stuff on disk.

Yes, this would have been a good idea but the encoding specification looks mandatory for both "\xc3\x84rtonwall" and "Ärtonwall".

exceptions.SyntaxError: Non-ASCII character '\xc3' in file /home/francois/dev/tahoe/src/allmydata/test/test_cli.py on line 567, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details (test_cli.py, line 567)

Replying to warner:

But, the buildbot is still indicating test failures: see http://allmydata.org/buildbot/waterfall . This time they're on dapper and feisty. That's really weird, I don't know what would cause those to fail but not the others.

Unfortunately, I'm not able to reproduce those failures on my dapper and feisty virtual machines. Can you please try running the attached test script (test-unicode.py) on your machines ?

Changed at 2008-11-16T00:11:47Z by francois

Attachment test-unicode.py added

Test script for unicode filename issues

comment:11 Changed at 2008-11-16T01:48:10Z by zooko

test-unicode.py prints "OK" on our dapper machine. I'm investigating in more detail why the unit test allmydata.test.test_cli.Cp.test_unicode_filename fails on that same machine and user.

comment:12 Changed at 2008-11-16T02:11:29Z by zooko

Well I've gotten as far as determining that the "tahoe_get" spawned by src/allmydata/test/test_cli.py@20081114134458-c0148-c011dbf3b2ce743b0237e192c3f5ed33c1b81c49#L569 is getting a 404 result:

2008-11-16 02:04:50.702Z [-] xxx 4 url: 'http://127.0.0.1:44984/uri/URI%3ADIR2%3Apobw3tvboe5ztqvt472hvnlsbm%3Apyunlfwglp5lziw4dlyybqsclplm2komshqvcqa3vp6nnrpnmzda/%C3%84rtonwall', resp: <httplib.HTTPResponse instance at 0xb0772ac> status 404

Now I need to read a bed-time story to a 7-year-old.

comment:13 Changed at 2008-11-18T20:21:02Z by zooko

So I switched from working on the dapper buildslave to my own workstation -- yukyuk, running Ubuntu Hardy.

The first thing I see is that that the current trunk gives a coding error:

$ python setup.py trial -a " allmydata.test.test_cli.Cp.test_unicode_filename"
running trial
(run setup.py with -vv for trial command-line details)
allmydata.test.test_cli
  Cp
    test_unicode_filename ...                                           [ERROR]

===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
  File "/home/zooko/playground/allmydata/tahoe/trunk/pristine-really/src/allmydata/test/test_cli.py", line 554, in test_unicode_filename
    open(fn1, "wb").write("unicode file content")
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 56: ordinal not in range(128)
-------------------------------------------------------------------------------
Ran 1 tests in 0.032s

FAILED (errors=1)

The next thing I see is that if I replace the unicode object with the string "\xc3\x84rtonwall" then the test never completes:

$ python setup.py trial -a " allmydata.test.test_cli.Cp.test_unicode_filename"
running trial
(run setup.py with -vv for trial command-line details)
allmydata.test.test_cli
  Cp
    test_unicode_filename ...                                           [ERROR]
                                          [ERROR]
                                          [ERROR]

===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_cli.Cp testMethod=test_unicode_filename> (test_unicode_filename) still running at 120.0 secs
===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 31004360 [0.0012834072113s] called=0 cancelled=0 LoopingCall<0.01>(Cp._poll, *(<bound method Cp._check_connections of <allmydata.test.test_cli.Cp testMethod=test_unicode_filename>>, 1227039706.426708), **{})()>
<DelayedCall 36538560 [0.0546979904175s] called=0 cancelled=0 _resetLogDateTime()>
<DelayedCall 31579816 [0.0570547580719s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 31579672 [0.23085975647s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 26005304 [0.253265857697s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 25395856 [0.192253112793s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 25384432 [0.211790084839s] called=0 cancelled=0 LoopingCall<60>(CPUUsageMonitor.check, *(), **{})()>
<DelayedCall 30849576 [76.1554584503s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 29971880 [171.961197615s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 30836280 [254.607980013s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 34387584 [143.276460648s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 26019816 [3480.25343394s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 26005448 [3480.21197605s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 31697592 [18.9836809635s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 25383712 [3480.19285607s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 33256872 [135.306014299s] called=0 cancelled=0 Reconnector._timer_expired()>
<DelayedCall 25397800 [3480.056885s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
<DelayedCall 26013568 [3480.23269224s] called=0 cancelled=0 LoopingCall<3600>(CacheDirectoryManager.check, *(), **{})()>
===============================================================================
[ERROR]: allmydata.test.test_cli.Cp.test_unicode_filename

Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
Selectables:
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 41841>
<<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 45178>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 60650>
<<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 33634>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 38985>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 57046>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 60445>
<<class 'twisted.internet.tcp.Port'> of nevow.appserver.NevowSite on 58303>
<<class 'twisted.internet.tcp.Port'> of foolscap.pb.Listener on 54338>
-------------------------------------------------------------------------------
Ran 1 tests in 120.079s

FAILED (errors=3)

comment:14 Changed at 2008-11-18T20:22:12Z by zooko

The next thing I notice is that if I replace it with the string "thisisjustastring" then the test still hangs.

comment:15 Changed at 2008-11-18T21:25:53Z by zooko

So, it looks like the problems on yukyuk don't necessarily have anything to do with unicode, but it's just that the test_cli.Cp test is the first one that tries to construct a TLS connection, or something:

local#247 14:08:37.958: TubConnector created from bduf2tsgfciu2j4zrs7oxvt52h5ikgtj to njeaxgfqknd2as37fxdkl4skyybywey6
local#248 14:08:37.958: connectTCP to ('127.0.0.1', 57617)
local#249 14:08:37.958: Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from bduf2tsg] [to njeaxgfq] at 0x19c8b50>
local#250 14:08:37.958: connectTCP to ('192.168.1.126', 57617)
local#251 14:08:37.958: Starting factory <foolscap.negotiate.TubConnectorClientFactory object [from bduf2tsg] [to njeaxgfq] at 0x19c8c90>
local#252 14:08:37.959: want to subscribe, but no introducer yet
local#253 14:08:37.959: want to publish, but no introducer yet
local#254 14:08:37.960: want to publish, but no introducer yet
local#255 14:08:37.962: dataReceived(isClient=False,phase=0,options={}): 'GET /id/njeaxgfqknd2as37fxdkl4skyybywey6 HTTP/1.1\r\nHost: 127.0.0.1\r\nUpgrade: TLS/1.0\r\nConnection: Upgrade\r\n\r\n'
local#256 14:08:37.962: handlePLAINTEXTServer: targetTubID='njeaxgfqknd2as37fxdkl4skyybywey6'
local#257 14:08:37.962: handlePLAINTEXTServer: wantEncrypted=True
local#258 14:08:37.963: startENCRYPTED(isClient=False, encrypted=True)
local#259 14:08:37.963: startTLS, client=False
local#260 14:08:37.966: negotiation had exception
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 37, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/pollreactor.py", line 186, in _doReadOrWrite
     why = selectable.doRead()
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 362, in doRead
     return self.protocol.dataReceived(data)
 --- <exception caught here> ---
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 402, in dataReceived
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 527, in handlePLAINTEXTServer
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 554, in sendPlaintextServerAndStartENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 589, in startENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 1139, in startTLS
     
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 660, in startTLS
     holder = Connection.startTLS(self, ctx)
 exceptions.AttributeError: type object 'Connection' has no attribute 'startTLS'
 ]
local#261 14:08:37.971: negotiation had internal error:
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 37, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/pollreactor.py", line 186, in _doReadOrWrite
     why = selectable.doRead()
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 362, in doRead
     return self.protocol.dataReceived(data)
 --- <exception caught here> ---
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 402, in dataReceived
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 527, in handlePLAINTEXTServer
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 554, in sendPlaintextServerAndStartENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 589, in startENCRYPTED
     
   File "build/bdist.linux-x86_64/egg/foolscap/negotiate.py", line 1139, in startTLS
     
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/internet/tcp.py", line 660, in startTLS
     holder = Connection.startTLS(self, ctx)
 exceptions.AttributeError: type object 'Connection' has no attribute 'startTLS'
 ]
local#262 14:08:37.976: Tub location set to 192.168.1.126:33701,127.0.0.1:33701
local#263 14:08:37.977: client running
local#264 14:08:37.977: dataReceived(isClient=False,phase=0,options={}): 'GET /id/njeaxgfqknd2as37fxdkl4skyybywey6 HTTP/1.1\r\nHost: 192.168.1.126\r\nUpgrade: TLS/1.0\r\nConnection: Upgrade\r\n\r\n'
local#265 14:08:37.977: handlePLAINTEXTServer: targetTubID='njeaxgfqknd2as37fxdkl4skyybywey6'
local#266 14:08:37.977: handlePLAINTEXTServer: wantEncrypted=True
local#267 14:08:37.978: startENCRYPTED(isClient=False, encrypted=True)
local#268 14:08:37.978: startTLS, client=False
local#269 14:08:37.986: negotiation had exception
 FAILURE:
 [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
   File "/usr/lib/python2.5/site-packages/Twisted-8.1.0-py2.5-linux-x86_64.egg/twisted/python/context.py", line 59, in callWithContext

comment:16 Changed at 2008-11-18T23:34:31Z by zooko

Okay, that problem on yukyuk was that I had deliberately introduced an uncaught exception into my system-wide installation of Twisted when testing something else. Please disregard all that.

So now to debug the unicode issue I'm moving back to our dapper buildslave..

comment:17 Changed at 2008-11-20T00:42:57Z by warner

So, as an extra datapoint, sys.getfilesystemencoding() on the two buildslaves that are failing reports "UTF-8". My local workstation (on which this test passes) reports "ANSI_X3.4-1968". However, on our gutsy and hardy buildslaves (which are also passing), it reports UTF-8 too.

Some differential analysis shows that one early sign of trouble is during the webapi t=set_children call that 'tahoe cp' does. This API uses the under-documented t=set_children call, which (from what I can tell) accepts a JSON-encoded POST body that has a mapping from child name to a description of that child (including the URI). I see a difference (between the working platform and the failing platform) in the JSON that is sent with this request.

The working platform (my workstation 'fluxx') sends:

{"\u00c4rtonwall": ["filenode", {"rw_uri": "URI:LIT:ovxgsy3pmrssaztjnrssay3pnz2gk3tu"}]}

The failing platform (our feisty buildslave) sends:

{"\u00c3\u0084rtonwall": ["filenode", {"rw_uri": "URI:LIT:ovxgsy3pmrssaztjnrssay3pnz2gk3tu"}]}

Which suggests that the failing platform is double-encoding the filename.

Still investigating.. so far this is pointing at the client-side code (in source:src/allmydata/scripts/tahoe_cp.py ).

comment:18 Changed at 2008-11-24T11:20:52Z by francois

Zooko, Brian:

If you give me a copy of your virtual machine disk, I could have a look into this failing test. What do you think ?

comment:19 Changed at 2008-11-24T13:56:23Z by zooko

That's a nice offer, Francois. I don't know how to get a copy of that vm image and I suspect that it would take hours to transfer over the net (or, heh, upload to a Tahoe grid).

I have an idea! I'll give you an account on the dapper buildslave. There. Please send me an ssh public key to zooko@… and I'll put it into francois@slave3.allmydata.com:.ssh/authorized_keys and then you should be able to log in!

comment:20 Changed at 2008-11-25T11:28:56Z by francois

I can see a difference in simplejson between the failing platform (dapper buildbot) and a working one (my workstation).

Test script:

#!/usr/bin/env python
# encoding=utf-8

import simplejson

a = "Ärtonwall"
b = simplejson.dumps(a)

print type(a), a
print type(b), b

Working platform:

<type 'str'> Ärtonwall
<type 'str'> "\u00c4rtonwall"

Failing platform:

<type 'str'> Ärtonwall
<type 'str'> "\uffc3\uff84rtonwall"

However, in both cases having \u control sequences in str objects sounds weird.

What is the current policy in tahoe according to string handling ? It looks like utf-8 encoded str objects are used in some places while unicode objects are used in others.

comment:21 Changed at 2008-11-25T15:15:05Z by zooko

Cc tahoe-dev@… removed

Oh, perhaps this has something to do with the version of simplejson!

The failing dapper has

2008-11-24 21:03:29.998Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.3, platform: Linux-Ubuntu_6.06-i686-32bit, simplejson: 1.8.1, pyopenssl: 0.7, setuptools: 0.7a1

The failing feisty2.5 has

2008-11-24 20:47:45.960Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.04-i686-32bit, simplejson: 1.4, pyopenssl: 0.6, setuptools: 0.7a1

Hm.

Passing buildslaves:

2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8 2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1 2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8 2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1 2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1

comment:22 Changed at 2008-11-25T15:16:27Z by zooko

Sigh. Forgot wiki markup

2008-11-24 20:53:04.924Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.18, python: 2.5.1, platform: CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:52:39.742Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4c1, platform: Linux-Ubuntu_6.10-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 09:54:08.473Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.1, twisted: 8.1.0, nevow: 0.9.32, python: 2.4.4, platform: Linux-debian_4.0-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:44:52.993Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Linux-Ubuntu_7.10-i686-32bit, simplejson: 1.7.1, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:51:45.037Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.1, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:52:14.527Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-Ubuntu_8.04-i686-32bit_ELF, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:46:17.288Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.4, zfec: 1.4.2, twisted: 8.0.1, nevow: 0.9.32, python: 2.4.3, platform: SunOS-5.11-i86pc-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-24 20:49:20.293Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 2.5.0, nevow: 0.9.32, python: 2.5.1, platform: Windows-XP-5.1.2600-SP2, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.7a1
2008-11-24 20:46:38.028Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.5.1, platform: Darwin-9.5.0-i386-32bit, simplejson: 2.0.5, pyopenssl: 0.6, setuptools: 0.6c8
2008-11-24 20:45:18.952Z [-] tahoe versions: allmydata: 1.2.0-r3235, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.4.1-2, twisted: 8.1.0, nevow: 0.9.31, python: 2.5.2, platform: Linux-Ubuntu_8.04-x86_64-64bit, simplejson: 2.0.3, pyopenssl: 0.7, setuptools: 0.7a1
2008-11-19 23:36:30.603Z [-] tahoe versions: allmydata: 1.2.0-r3225, foolscap: 0.3.2, pycryptopp: 0.5.12, zfec: 1.4.2, twisted: 8.1.0, nevow: 0.9.32, python: 2.6.0, platform: Linux-_-x86_64-64bit_ELF, simplejson: 2.0.4, pyopenssl: 0.8, setuptools: 0.7a1

comment:23 Changed at 2008-11-25T16:37:57Z by francois

Yes, seems plausible.

I tried patching setup.py to require simplejson 2.0.5.

francois@slave3:~/tahoe$ darcs diff
diff -rN old-tahoe/setup.py new-tahoe/setup.py                
210a211,212
> setup_requires.append('simplejson >= 2.0.5')
>

But this fail because simplejson-1.8.1 is already installed system-wide.

python setup.py build_tahoe
Traceback (most recent call last):
  File "setup.py", line 451, in ?
    zip_safe=False, # We prefer unzipped for easier access.
  File "/usr/lib/python2.4/distutils/core.py", line 110, in setup
    _setup_distribution = dist = klass(attrs)
  File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/setuptools/dist.py", line 219, in __init__ 
    self.fetch_build_eggs(attrs.pop('setup_requires'))
  File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/setuptools/dist.py", line 242, in fetch_build_eggs
    for dist in working_set.resolve(
  File "/home/francois/tahoe/support/lib/python2.4/site-packages/setuptools-0.6c10dev.egg/pkg_resources.py", line 528, in resolve
    raise VersionConflict(dist,req) # XXX put more info here
pkg_resources.VersionConflict: (simplejson 1.8.1 (/usr/lib/python2.4/site-packages/simplejson-1.8.1-py2.4-linux-i686.egg), Requirement.parse('simplejson>=2.0.5'))
make[1]: *** [build-once] Error 1

Would it be possible to remove this system-wide egg ?

comment:24 Changed at 2008-11-25T17:19:00Z by francois

I can confirm that upgrading simplejson to version 2.0.5 does fix this bug on slave3.

This was verified by installing a "clean" python environment with "virtualenv --no-site-package" after applying attached patch.

Changed at 2008-11-25T17:19:39Z by francois

Attachment require-simplejson.patch added

Require simplejson version >= 2.0.5

comment:25 Changed at 2008-11-25T20:07:34Z by zooko

Regarding the versioning conflict, see #530 (use setuptools's --multi-version mode).

Thanks for the fix! I'll apply your patch post haste!

comment:26 Changed at 2008-11-25T20:09:45Z by zooko

But waitasecond, how is it that one of the successful buildbots was using simplejson 2.0.4 and another was using simplejson 1.7.1?

comment:27 Changed at 2008-11-25T20:11:44Z by zooko

And one of the failing buildbots (dapper) was using simplejson 1.8.1?

comment:28 Changed at 2008-11-25T22:05:39Z by zooko

Ok, 5ebd7319822836ed, which raises the setuptools requirement from >= 1.4 to > 1.8.1, made the buildbots turn green. I don't really know why, but I'm closing this ticket as "fixed". Thanks a lot, Francois!

(P.S. Please pursue your earlier line of thought of "What exactly is the policy for handling strings within the Tahoe codebase?".)

comment:29 Changed at 2008-11-25T22:05:46Z by zooko

Resolution set to fixed
Status changed from reopened to closed

comment:30 Changed at 2008-11-25T22:59:00Z by zooko

For anybody who is investigating the history of which version of simplejson different versions of tahoe have required:

f76a81fda216e4fb 49dccbd4668a2b12 1f2e3fc912bb897e b0dd88158a3e5b0f 44c73492704144e8

comment:31 Changed at 2008-11-26T02:45:37Z by warner

So, our policy for handling strings should be:

each variable holds either a bytestring or a unicode string
no variables are uncertain
all function arguments accept one or the other, never both

and comments, variable names, and argument names should help us make this distinction.

I have a vague feeling that our code was somehow asking simplejson to do something unreasonable.

In particular, that test code above (http://allmydata.org/trac/tahoe/ticket/534#comment:20) is asking simplejson to take a bytestring with non-ASCII characters and JSONify it, which is an unreasonable thing to ask of it (JSON can encode unicode, but not arbitrary bytestrings). The input to simplejson.dumps() should always be unicode objects or bytestrings with only ASCII characters.

The tahoe_cp code must take responsibility for transforming the sys.argv bytestring (which, hopefully, is UTF-8 encoded) into a unicode object before doing anything else with it. When it is then put into the webapi set_children body (which is defined to be a JSON-encoded dictionary), simplejson.dumps() should be able to handle it. Likewise, if the sys.argv name gets used in a URL (maybe for tahoe_cp, more likely for tahoe_put), it must be encoded into UTF-8 and then URL-encoded.

I'm glad this fix seems to improve matters, but I'd like to make sure that our code is doing the right thing.

comment:32 Changed at 2008-12-08T04:33:55Z by zooko

I'm re-opening this ticket because, per Brian's most recent comment, it seems like the test code is asking simplejson to do something unreasonable, and because per the new #555 (tahoe .deb cannot be installed on hardy: simplejson dependency is too new), it is problematic for Tahoe to require simplejson > 1.7.1.

comment:33 Changed at 2008-12-08T04:34:44Z by zooko

See also 9d729109d2d4c24a where I change the requirement to simplejson >= 1.7.1 (again).

comment:34 Changed at 2008-12-08T10:59:40Z by francois

Resolution fixed deleted
Status changed from closed to reopened

comment:35 Changed at 2008-12-08T14:33:40Z by zooko

So there is something about our Ubuntu Dapper buildslave that makes it fail allmydata.test.test_cli.Cp.test_unicode_filename even when other buildslaves using the same version of simplejson pass. Anyway, I just went and experimented on the Dapper buildslave, by easy_install'ing different versions of simplejson from here: http://pypi.python.org/simple/simplejson

With simplejson v1.7.5, v1.8, v1.8.1, and v1.9 this unit test fails. With simplejson v2.0 it passes.

I still haven't understood what this test does or why Brian thinks it is an invalid test.

comment:36 follow-up: ↓ 37 Changed at 2008-12-09T02:13:30Z by warner

I'm sorry if it sounded like I was accusing test_unicode_filename of doing something invalid. I think test_unicode_filename is ok (although I haven't looked at it very closely yet).

My comment about "that test code above .. asking .. an unreasonable thing" was specifically about the test script which Francois posted to this ticket:

#!/usr/bin/env python
# encoding=utf-8

import simplejson

a = "Ärtonwall"   # <- LATIN CAPITAL LETTER A WITH DIAERESIS (U+00C4)
b = simplejson.dumps(a)

print type(a), a
print type(b), b

This code is passing a bytestring with non-ASCII characters to simplejson.dumps, but JSON is not capable of encoding non-ASCII bytestrings. So it's pushing simplejson.dumps beyond its normal territory, into the range where things are not usually as well tested (i.e. how does dumps() respond to invalid inputs?). I'm not surprised that its response will vary a lot from one minor version to the next: error cases are usually driven by bug reports.

I'd find it more reasonable if that code instead used:

a = u"Ärtonwall"

The tahoe_cp.py code should be careful to never send non-ASCII bytestrings to simplejson.dumps. In fact, it should never send bytestrings to dumps() at all, but instead should be sending unicode objects (well, at least for filenames. for other tokens it's pretty benign).

So the analysis that I want to do on this issue is to trace the code in tahoe_cp.py, from the point where a filename is read off the local disk (using os.listdir), to the point where it is passed to simplejson.dumps. This is an area of considerable disagreement within the Python community (the python-dev list has been very busy in the last week with a discussion of how unicode filenames and environment variables ought to be handled in the just-released Python 3.0). I fear we're going to need some platform-specific code here, but I think it's safe for us to declare that on linux at least we expect os.listdir to give us bytestrings with UTF-8 encoded names.

Note that this raises the question of what 'tahoe cp' ought to do when a filename on local disk has non-UTF8 non-ASCII bytes in it: apparently (according to one of Glyph's recent posts, IIRC) some systems (KDE?) stuff the high-bit-set bytes into some magical reserved unicode space, so that they can at least survive a roundtrip. It might not be a bad idea to use this same technique: if filename.decode("utf-8") on the response from os.listdir raises UnicodeDecodeError?, then map it into this special space instead.

comment:37 in reply to: ↑ 36 Changed at 2008-12-22T11:39:48Z by francois

I fully agree with Brian's comment on my test script being broken.

However, the test_unicode_filename test looks reasonable because it calls do_cli with UTF-8 bytestrings, in much the same way as if directly called on the command line under Linux.

I'm going to investigate to find if and where bytestrings are being sent to simplejson, following up on warner.

comment:38 Changed at 2008-12-24T20:43:40Z by zooko

Per recent discussion http://allmydata.org/pipermail/tahoe-dev/2008-December/000948.html , I've reverted some recent changes in 883e51b02dae732a, and marked the cli unicode test as todo in 9f117dbe8f7260a0.

Changed at 2009-02-17T11:57:45Z by francois

Attachment stringutils.dpatch added

comment:39 Changed at 2009-02-17T18:02:10Z by zooko

Cc tahoe-dev@… added

Francois:

Thanks for the patch! I'm glad that it adds a test of mkdir as well.

Two things: 1. We prefer functions which are stricter about their inputs, so could you change to_unicode() so that it raises an exception if the argument is not a str? Like this:

from util.assertutil import precondition

def to_unicode(s):
    precondition(isinstance(s, str), s)

Could you please add some doc to CLI.txt explaining that it currently works only for utf-8 strings on argv or from the filesystem, and that it tries to detect if your filesystem is providing utf-8 and raise an exception if not. Oh, and also there is a third thing:

What happens if I run a command-line like "tahoe cp A B" and sys.getdefaultencoding() returns something other than utf-8? Do I see a Python backtrace on stdout? Please make it so that this causes an explanatory "USAGE"-style string.

Oh, and

Isn't sys.getdefaultencoding() just telling how the Python interpreter is configured, not the underlying operating system? Oh, actually look: there is a "sys.getfilesystemencoding()":

http://docs.python.org/library/sys.html#sys.getfilesystemencoding

So I guess 4 is to use sys.getfilesystemencoding() instead of sys.getdefaultencoding() when trying to determine what encoding we get back from os.listdir(), and 5 is to document in CLI.txt the distinction between filesystem encoding and command-line-arguments encoding.

Phewf. Too bad this isn't easier.

Thanks a lot for your help, François!

Regards,

Zooko

comment:40 Changed at 2009-02-17T18:23:05Z by zooko

See also François's post to tahoe-dev, which also mentions #565 (unicode arguments on the command-line) and #629 ('tahoe backup' doesn't tolerate 8-bit filenames).

comment:41 Changed at 2009-02-24T01:09:43Z by francois

Ok, the latest patch bundle (unicode.dpatch) will hopefully fix all those unicode issues (including #629) in a much cleaner way.

All encoding conversion code is now located in stringutils.py. It provides a single location where we might be able to use better encoding detection methods such as those discussed on the mailing-list.

Methods currently in used are:

for command line arguments (sys.argv)
- convert from UTF-8 to unicode
- if it fails, returns an UsageError?

for text display (sys.stdout)
- convert from unicode to sys.stdout.encoding and replace non-representable characters by '?'

for filename encoding on the filesystem (os.listdir(), open())
- convert between sys.getfilesystemencoding() and unicode
- if it fails, returns an UsageError?

Many precondition checks have been added to ensure that filenames are treated as unicode objects.

comment:42 Changed at 2009-02-24T01:13:10Z by francois

Owner set to francois
Status changed from reopened to new

comment:43 Changed at 2009-02-24T01:13:38Z by francois

Status changed from new to assigned

Changed at 2009-02-24T11:50:41Z by francois

Attachment unicode.dpatch added

comment:44 Changed at 2009-02-24T11:53:37Z by francois

Uploaded a new version of unicode.dpatch which include a patch to fix a conflict introduced by 5d57da93fd336692.

Changed at 2009-02-25T09:31:55Z by francois

Attachment unicode-v2.dpatch added

comment:45 Changed at 2009-02-25T09:32:27Z by francois

Uploaded a new version of my patch (unicode-v2.dpatch) which fix a conflict introduced by 7e8958671b2e07f4.

comment:46 Changed at 2009-03-29T14:10:43Z by francois

Keywords unicode filename utf-8 added
Milestone changed from undecided to 1.3.1

Ok, we had plenty of discussion on the mailing-list. Zooko posted a summary available at http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html

Basically, there are two cases:

Filenames which, based on locale settings, can be decoded into unicode strings
Filenames which, based on locale settings, cannot be decode into unicode strings

Case 1 is what's currently implemented by my patches.

Case 2 requires to optionally add two attributes to each file, a bytestring and a guessed encoding. As Brian pointed out, the bytesting attribute must be encoded as base32 to survive to a JSON trip.

Now, I first want to focus on case 1 while raising an helpful error on case 2.

Does it sound any good ?

comment:47 Changed at 2009-03-31T03:48:44Z by zooko

Francois: thanks for working on this! I was planning to amend your patch myself, but I'll let you do it.

Here is my most recent idea about how this should be done:

http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html

Except that this *isn't* my most recent idea after all. I amended my intent a little, as prompted by pointed questions from nejucomo on IRC, and by looking at the actual source code where directories are processed:

http://allmydata.org/trac/tahoe/browser/src/allmydata/dirnode.py?rev=6e57576f2eeedcb8#L168

Then I tried to write down my ideas in detail and this forced me to realize that they were incomplete and wrong and I had to amend them a whole lot more in order to finish this letter. Finally, I asked JP Calderone for help, and he helped me understand how to write filenames back into a local Linux filesystem without risking that the user will accidentally overwrite their local files with tahoe files (because the tahoe files were written out under different representation than they were displayed), and how to do normalization, and how to cheaply ensure that silent misdecodings could be repaired in some future generation.

Okay, here's the best design yet:

I think that the unicode representation of the filename should continue to be the unique key in the directory (which current Tahoe 1.3.0 requires).

So there should be a data structure with a required "filename" part, and a required "failed_decode" flag, and an optional "alleged_encoding" part. The "filename" part is the canonical value of the filename, but we recognize that sometimes we can't actually get the *real* filename into unicode form. If our attempt to interpret the filename into unicode fails, then we set the "failed_decode" flag and put the iso-8859-1-decoding of it into the "filename" part.

Here are the steps of reading a filename from the filesystem and adding that filename into an existing Tahoe directory.

On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave out the "alleged_encoding" part. Set the "failed_decode" flag to False.

On Linux read the filename with the string APIs to get "bytes" and call sys.getfilesystemencoding() to get "alleged_encoding". Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and the alleged_encoding, and set the "failed_decode" to False. (Storing the alleged_encoding is for the benefit of future generations, who may discover that the decoding was actually wrong even though it didn't raise an error, and who could then use the alleged_encoding to undo the damage. For example Shawn Willden has a prototype tool which lets a human examine the filename as decoded with different encodings and pick the one that means something in a language they know.)

2.b. If this decoding fails, then we decode it again with bytes.decode('iso-8859-1', 'strict'). Do not normalize it. Put the resulting unicode object into the "filename" part, set the "failed_decode" flag to True, and leave the "alleged_encoding" field out. This is a case of mojibake:

http://en.wikipedia.org/wiki/Mojibake

The reason to go the mojibake route is that it preserves the information, and in theory someone could later decode it and figure out the original filename. This has actually happened at least once, as shown by the photograph on that wikipedia page of the package which was delivered to the Russian recipient. Mojibake! (footnote 1)

How does that sound?

Phewf. Okay, now for the trip in the other direction. Suppose you have a Tahoe filename object, and you need to create a file in the local filesystem, because for example the user runs "tahoe cp -r $DIRCAP/subdir .". There are four cases:

Case 1: You are using a unicode-safe filesystem such as Windows or Mac, and you have a unicode object with failed_decode=False.

This is easy: use the Python unicode filesystem APIs to create the file and be happy.

Case 2: You are using a unicode-safe filesystem and you have a unicode object with failed_decode=True.

This is easy: use the Python unicode filesystem APIs to create the file, passing the latin-1-decoded filename (mojibake!).

Case 3: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=False.

This is easy: use the Python unicode filesystem APIs to create the file.

Case 4: You are using a plain-bytes filesystem such as Linux, and you have a unicode object with failed_decode=True.

Now we should *encode* the filename using iso-8859-1 to get a sequence of bytes, and then write those bytes into the filesystem using the Python string filesystem API. This is no worse than any alternative, and in the case that the target filesystem has the same encoding as the original filesystem (such as because it is the *same* as the original filesystem, or because it is owned by a friend of the owner of the original filesystem), then this will restore the file to its proper name.

By the way, please see David Wheeler's recent proposal to start enforcing filename constraints in Linux: http://lwn.net/Articles/325304 . His proposals include changing Linux to require utf-8-encoding of all filenames.

Regards,

Zooko

footnote 1: I know that Alberto Berti has previously argued on tahoe-dev and on IRC that mojibake is less clean than the alternative of using bytes.decode(alleged_encoding, 'replace'). The latter is lossy, but it more clearly shows to the user that some or all of the filename couldn't be decoded. Alberto and others had convinced me of the wisdom of this, and I actually wrote this entire document specifying the 'decode-with-replace' approach instead of the mojibake approach, but I eventually realized that it wouldn't work. For one thing it was rather complicated to decide how to handle multiple filenames that all decode-with-replace to the same unicode name (you could imagine a whole directory full of files all named '????' because the locale is wrong). But the real killer is what to do when you are going to write the file back into the local filesystem. If you write a decoded-with-replace file back, then this means a round-trip from linux to tahoe and back can mess up all of your filenames. If you write the actual original bytes into the filesystem, then this means that you might accidentally overwrite files with a "tahoe cp", since "tahoe ls" just shows files with "???" in their names, but "tahoe cp" writes files out with actual characters instead of question marks.

comment:48 Changed at 2009-04-01T07:26:51Z by warner

Zooko: this sounds great to me!

comment:49 Changed at 2009-04-01T14:20:40Z by zooko

Milestone changed from 1.4.0 to 1.5.0

We're going to go ahead and release tahoe-1.4.0 ASAP, but this will be one of the first improvements to land in tahoe-1.5.0, I hope! :-)

Changed at 2009-04-01T19:41:52Z by francois

Attachment unicode-v3.dpatch added

comment:50 Changed at 2009-04-01T19:44:26Z by francois

Ok, here an amended version of all the previous patches.

It would be great if people could give it a try under Windows and MacOS X and publish their experiences here.

It currently only handle to simple case where filename encoding is coherent with what's detected by Python (sys.getfilesystemencoding()).

comment:51 Changed at 2009-04-08T23:26:31Z by zooko

I'm reviewing your most recent patch, François.

I'll be posting my observations in separate comments as I understand more of the patch.

Here's the first observation:

The patch seems to assume that the terminal handles either ascii or utf-8 on stdout, but what about terminals that handle a different encoding, such as Windows cmd.exe (which presumably handles whatever the current Windows codepage is, or else utf-16le)? Apparently sys.stdout.encoding will tell us what python thinks it should use if you pass a unicode string to it with print myunicstr or sys.stdout.write(myunicstr).

In any case the documentation should explain this -- that what you see when you run tahoe ls will depend on the configuration of your terminal. Hm, this also suggests that it isn't correct for tahoe to have a unicode_to_stdout() function and instead we should just rely on the python sys.stdout encoding behavior. What do you think?

I guess one place where I would be willing to second-guess python on this is, if the sys.stdout.encoding says the encoding is ascii or says that it doesn't know what the encoding is, then pre-encode your unicode strings with utf-8 (or, if on Windows, with utf-16le), before printing them or sys.stdout.write()'ing them. This is because of the following set of reasons:

A misconfigured environment will result in python defaulting to ascii when utf-8 will actually work better (I just now discovered that my own Mac laptop on which I am writing this was so misconfigured, and when I tried to fix it I then misconfigured it in a different way that had the same result! The first was: LANG and LC_ALL were being cleared out in my .bash_profile, the second was: I set LANG and LC_ALL to en_DK.UTF-8, but this version of Mac doesn't support that locale, so I had to change it to en_US.UTF-8.)

Terminals that actually can't handle utf-8 and can only handle ascii are increasingly rare.

If there is something that can handle only ascii and you give it utf-8, you'll be emitting garbage instead of raising an exception, which might be better in some cases. On the other hand I suppose it could be worse in others. (Especially when it happens to produce control characters and screws up your terminal emulator...)

I'm not entirely sure that this second-guessing of python is really going to yield better results more often than it yields worse results, and it is certainly more code, so I would also be happy with just emitting unicode objects to stdout and letting python and the local system config do the work from there.

small details and English spelling and editing: s/Tahoe v1.3.1/Tahoe v1.5.0/ s/aliase/alias/ s/commande/command/ s/moderns/modern/

comment:52 Changed at 2009-04-09T02:54:59Z by zooko

Oh, I see that your unicode_to_stdout function is exactly like the builtin python one except that it uses 'replace' mode. That's interesting. I guess maybe that is an improvement over the builtin python behavior (for our particular uses), and it also provides a function where we can add further second-guessing of python such as falling back to utf-8 or utf-16le.

comment:53 Changed at 2009-04-09T03:02:29Z by zooko

I'm trying to understand this patch better by splitting it up into smaller patches. Here is a version of it which strips out all the unicode_to_stdout() (which I subsequently realized is not a bad idea). The next step probably ought to be splitting this one into yet smaller patches -- such as everything to do with aliases in one, and everything to do with files in another, and everything to do with cmdline args in a third. Not sure if these can be cleanly separated (since filenames and aliases can come in through the cmdline args, for example), but I might try and see if I understand it better by the attempt. :-)

Changed at 2009-04-09T03:59:02Z by zooko

Attachment unicode-v3-minus-the-stdout-parts.patch.txt added

comment:54 Changed at 2009-04-09T04:01:34Z by zooko

And here is a version of it where I stripped out the aliases parts. (Note: I definitely want the aliases part, I'm just trying to separate them out to understand it better, and I'm posting the intermediate work here just in case François or anyone wants to see what I'm doing.

Changed at 2009-04-09T04:02:14Z by zooko

Attachment unicode-v3-minus-the-stdout-and-aliases-parts.patch.txt added

comment:55 Changed at 2009-04-09T04:04:46Z by zooko

Okay and here I've stripped out the argv parts:

Changed at 2009-04-09T04:09:10Z by zooko

Attachment unicode-v3-minus-the-stdout-and-aliases-and-argv-parts.patch.txt added

comment:56 Changed at 2009-04-09T04:54:37Z by zooko

Here I've stripped out the parts having to do with encoding for URLs. Also I noticed some kludgy broken tweaks that I had put in during earlier steps of this process and stripped those out too.

Changed at 2009-04-09T04:56:05Z by zooko

Attachment unicode-v3-minus-the-stdout-and-aliases-and-argv-and-url-parts.patch.txt added

comment:57 Changed at 2009-04-10T16:51:16Z by zooko

Hm. I just learned that the windows-1252 encoding is a superset of the iso-8859-1 a.k.a. latin-1 encoding:

http://en.wikipedia.org/wiki/Windows-1252

The difference is that some bytes which are mapped to control characters in iso-8859-1 are mapped to characters in windows-1252. (Also maybe some of the characters are in a different order but that doesn't matter for this purpose.)

Does that mean that when doing the mojibake fallback when decoding fails, if we decode with windows-1252 instead of iso-8859-1 then we'll have fewer control characters in the resulting unicode string? That sounds like an improvement.

comment:58 Changed at 2009-04-28T16:08:48Z by zooko

Hm, so there is this idea by Markus Kuhn called utf-8b. utf-8b decoding is just like utf-8 decoding, except that if the input string turns out not to be valid utf-8 encoding, then utf-8b stores the invalid bytes of the string as invalid code points in the resulting unicode object. This means that utf8b_encode(utf8b_decode(x)) == x for any x (not just for x's which are utf-8-encodings of a unicode string).

I wonder if utf-8b provides a simpler/cleaner way to accomplish the above. It would look like this. Take the design written in http://allmydata.org/trac/tahoe/ticket/534#comment:47 and change step 2 to be like this:

On Linux read the filename with the string APIs to get "bytes" and call sys.getfilesystemencoding() to get "alleged_encoding". If the alleged encoding is ascii or utf-8, or if it absent or invalid or denotes a codec that we don't have an implementation for, then set alleged_encoding = 'utf-8b' instead. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and if the encoding that was used was not utf-8b then store the alleged_encoding. (If the encoding that was used was utf-8b, then don't store the alleged_encoding -- utf-8b is the default and we can save space by omitting it.)

2.b. If this decoding fails, then we decode it with bytes.decode('utf-8b'). Do not normalize it. Put the resulting unicode object into the "filename" part. Do not store an "alleged_encoding".

Using utf-8b to store bytes from a failed decoding instead of iso-8859-1 means that if the name or part of the name is actually ascii or utf-8, then it will be (at least partially) legible. It also means that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in koi8-r, but failed to decode using the koi8-r codec, and so was instead decoded using utf-8b, or whether the filename was originally alleged to be in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's right.)

An implementation, including a Python codec module, by Eric S. Tiedemann (1966-2008; I miss him): http://hyperreal.org/~est/utf-8b

An implementation for GNU iconv by Ben Sittler: http://bsittler.livejournal.com/10381.html

A PEP by Martin v. Löwis to automatically use utf-8b whenever you would otherwise use utf-8: http://www.python.org/dev/peps/pep-0383

comment:59 Changed at 2009-04-28T18:00:54Z by zooko

I wrote: """Using utf-8b to store bytes from a failed decoding instead of iso-8859-1 means ... that we can omit the "failed_decode" flag, because it makes no difference whether the filename was originally alleged to be in koi8-r, but failed to decode using the koi8-r codec, and so was instead decoded using utf-8b, or whether the filename was originally alleged to be in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's right.)"""

Oh no, I am wrong about this because of the existence of byte-oriented systems where the filesystem encoding is not utf-8. When outputting a filename into such a system, you ought to check the "failed_decode" flag and, if it is set, reconstitute the original bytes before proceeding to emit the name using the byte-oriented API.

Here is some code which attempts to explain what I mean. It doesn't actually run -- for example it is missing its import statements -- but writing it helped me think this through a little more:

# A wrapper around the Python Standard Library's filename access functions to
# provide a uniform API for all platforms and to prevent lossy en/de-coding.

class Fname:
    def __init__(self, name, failed_decode=False, alleged_encoding=None):
        self.name = name
        self.failed_decode = failed_decode
        self.alleged_encoding = alleged_encoding

if platform.system() in ('Linux', 'Solaris'):
    # on byte-oriented filesystems, such as Linux and Solaris

    def unicode_to_fs(fn):
        """ Encode an unicode object to bytes. """
        precondition(isinstance(fn, Fname), fn)
        precondition(isinstance(fn.name, unicode), fn.name)

        if fn.failed_decode:
            # This means that the unicode string in .name is not actually the
            # result of a successful decoding with a suggested codec, but is
            # instead the result of stuffing the bytes into a unicode by dint
            # of the utf-8b trick.  This means that on a byte-oriented system,
            # you shouldn't treat the .name as a unicode string containing
            # chars, but instead you should get the original bytes back out of
            # it.
            return fn.name.encode('utf-8b', 'python-replace')
        else:
            fsencoding = sys.getfilesystemencoding()
            if fsencoding in (None, '', 'ascii', 'utf-8'):
                fsencoding = 'utf-8b'
            try:
                return fn.name.encode(encoding, 'python-escape')
            except UnicodeEncodeError:
                raise usage.UsageError("Filename '%s' cannot be encoded using  \
the current encoding of your filesystem (%s). Please configure your locale \
correctly or rename this file." % (s, sys.getfilesystemencoding()))

    def fs_to_unicode(bytesfn):
        """ Decode bytes from the filesystem to a unicode object. """
        precondition(isinstance(bytesfn, str), str)

        alleged_encoding = sys.getfilesystemencoding()
        if alleged_encoding in (None, '', 'ascii', 'utf-8'):
            alleged_encoding = 'utf-8b'
            
        try:
            unicodefn = bytesfn.decode(alleged_encoding, 'strict')
        except UnicodeEncodeError:
            unicodefn = bytesfn.decode('utf-8b', 'python-escape')
            return Fname(unicodefn)
        else:
            unicodefn = unicodedata.normalize('NFC', unicodefn)
            if alleged_encoding == 'utf-8b':
                return Fname(unicodefn)
            else:
                return Fname(unicodefn, alleged_encoding)

    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        bytesfn = unicode_to_fs(fn.name)
        res = os.listdir(bytesfn)
        return([fs_to_unicode(fn) for fn in res])

else:
    # on unicode-oriented filesystems, such as Mac and Windows
    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        return [Fname(n) for n in os.listdir(fn.name)]

Also attached to this ticket...

Changed at 2009-04-28T18:11:37Z by zooko

Attachment fsencoding.py added

PEP 383'ish implementation of listdir()

comment:60 Changed at 2009-04-29T06:11:02Z by zooko

I was referencing this ticket and my (untested) example code here in the PEP 383 thread on python-dev (http://mail.python.org/pipermail/python-dev/2009-April/089170.html ), and I realized that I forgot to set the "failed_decode" flag in my example code. Here is a new version of it that sets the "failed_decode" flag, and that uses utf-8 for the attempted decode and only uses utf-8b when doing the fallback (for clarity -- should make no difference since the attempted decode uses error handling mode 'strict').

I will also attach it.

# A wrapper around the Python Standard Library's filename access functions to
# provide a uniform API for all platforms and to prevent lossy en/de-coding.

class Fname:
    def __init__(self, name, failed_decode=False, alleged_encoding=None):
        self.name = name
        self.failed_decode = failed_decode
        self.alleged_encoding = alleged_encoding

if platform.system() in ('Linux', 'Solaris'):
    # on byte-oriented filesystems, such as Linux and Solaris

    def unicode_to_fs(fn):
        """ Encode an unicode object to bytes. """
        precondition(isinstance(fn, Fname), fn)
        precondition(isinstance(fn.name, unicode), fn.name)

        if fn.failed_decode:
            # This means that the unicode string in .name is not
            # actually the result of a successful decoding with a
            # suggested codec, but is instead the result of stuffing the
            # bytes into a unicode by dint of the utf-8b trick.  This
            # means that on a byte-oriented system, you shouldn't treat
            # the .name as a unicode string containing chars, but
            # instead you should get the original bytes back out of it.
            return fn.name.encode('utf-8b', 'python-replace')
        else:
            fsencoding = sys.getfilesystemencoding()
            if fsencoding in (None, '', 'ascii', 'utf-8'):
                fsencoding = 'utf-8b'
            try:
                return fn.name.encode(encoding, 'python-escape')
            except UnicodeEncodeError:
                raise usage.UsageError("Filename '%s' cannot be \
encoded using the current encoding of your filesystem (%s). Please \
configure your locale correctly or rename this file." %
                                       (s, sys.getfilesystemencoding()))

    def fs_to_unicode(bytesfn):
        """ Decode bytes from the filesystem to a unicode object. """
        precondition(isinstance(bytesfn, str), str)

        alleged_encoding = sys.getfilesystemencoding()
        if alleged_encoding in (None, '', 'ascii'):
            alleged_encoding = 'utf-8'
            
        try:
            unicodefn = bytesfn.decode(alleged_encoding, 'strict')
        except UnicodeEncodeError:
            unicodefn = bytesfn.decode('utf-8b', 'python-escape')
            return Fname(unicodefn, failed_decode=True)
        else:
            unicodefn = unicodedata.normalize('NFC', unicodefn)
            if alleged_encoding == 'utf-8':
                return Fname(unicodefn)
            else:
                return Fname(unicodefn, alleged_encoding)

    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        bytesfn = unicode_to_fs(fn.name)
        res = os.listdir(bytesfn)
        return([fs_to_unicode(fn) for fn in res])

else:
    # on unicode-oriented filesystems, such as Mac and Windows
    def listdir(fn):
        assert isinstance(fn, Fname), fn
        assert isinstance(fn.name, unicode), fn.name
        return [Fname(n) for n in os.listdir(fn.name)]

Changed at 2009-04-29T06:11:26Z by zooko

Attachment fsencode.py added

comment:61 Changed at 2009-04-29T16:35:39Z by francois

As promised, here's a darcs patch bundle containing the 17 patches which implements *basic* Unicode filename support for review purpose.

Changed at 2009-04-29T16:36:10Z by francois

Attachment tahoe-534-bundle.dpatch added

comment:62 Changed at 2009-04-29T16:55:48Z by zooko

Hooray! Thanks!

comment:63 Changed at 2009-05-01T04:53:47Z by zooko

Fixed another small error that I discovered while writing to python-dev. Attaching latest version.

Changed at 2009-05-01T04:54:47Z by zooko

Attachment fsencode.2.py added

comment:64 Changed at 2009-05-01T05:09:15Z by zooko

Fixed a couple more bugs or unnecessaries found by inspection while writing to tahoe-dev.

Changed at 2009-05-01T05:09:31Z by zooko

Attachment fsencode.3.py added

comment:65 Changed at 2009-05-01T15:33:16Z by zooko

Yet again I realize that my previous understanding was incomplete or subtly flawed. This encoding stuff is very tricky! Here's my post to python-dev where I postulate that PEP 383 actually does solve most of our problems after all:

http://mail.python.org/pipermail/python-dev/2009-May/089318.html

Please read!

comment:66 Changed at 2009-05-03T15:14:28Z by zooko

Okay, my current design is at the end of this message, including rationale:

http://allmydata.org/pipermail/tahoe-dev/2009-May/001670.html

Here is the summary:

To copy an entry from a local filesystem into Tahoe:

On Windows or Mac read the filename with the unicode APIs.

Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata.

On Linux or Solaris read the filename with the string APIs, and

store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata.

2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake!

(handling collisions) In either case 2.a or 2.b the resulting

unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name. If the failed_decode flags differ then this is a case of collision -- the old entry and the new entry had (as far as we are concerned) different names that accidentally generated the same unicode. Alter the new entry's name, for example by appending "~1" and then trying again and incrementing the number until it doesn't match any extant entry.

To copy an entry from Tahoe into a local filesystem:

Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted.

comment:67 Changed at 2009-05-04T23:02:46Z by zooko

Keywords review added

comment:68 Changed at 2009-05-04T23:04:16Z by zooko

Owner changed from francois to nobody
Status changed from assigned to new

comment:69 Changed at 2009-06-09T19:48:05Z by zooko

François nicely broke his patch up into many small patches and attached the whole set as "tahoe-534-bundle.dpatch" ((68.8 KB) - added by francois 6 weeks ago). Here I'm attaching them as separate attachments so that I can link to them. Also I re-recorded one in order to avoid a conflict in test_cli.py.

comment:70 Changed at 2009-06-21T20:01:35Z by zooko

Keywords review removed
Owner changed from nobody to francois

Unsetting the "review" keyword because this patch is not ready to review until François (or someone) reworks it to use os.listdir(a_unicode_thing) on Windows and Mac and os.listdir(a_string_thing) on other platforms.

comment:71 Changed at 2009-06-30T18:08:07Z by zooko

Milestone changed from 1.5.0 to eventually

comment:72 Changed at 2009-11-23T02:45:09Z by davidsarah

Keywords forward-compatibility added

comment:73 Changed at 2009-11-23T03:22:23Z by davidsarah

Presumably these are the patches #734 is referring to.

comment:74 Changed at 2009-12-03T18:11:52Z by zooko

Keywords encoding utf-8 removed

comment:75 follow-up: ↓ 76 Changed at 2009-12-06T21:18:06Z by davidsarah

To fix #734, unicode_to_stdout in the patched util/stringutil.py should be something like:

def unicode_to_stdout(s):
    """
    Encode an unicode object for representation on stdout.
    """

    if s is None:
        return None
    precondition(isinstance(s, unicode), s)

    try:
        return s.encode(sys.stdout.encoding, 'replace')
    catch LookupError:
        return s.encode('utf-8', 'replace')
}

(This doesn't explain why tahoe_ls.py} was attempting to use None for name it was trying to convert, but that's presumably a separate issue not caused by the patch.)

The reason for the catch LookupError is that Python can sometimes set sys.stdout.encoding to an encoding that it does not recognize. For example, if you have Windows cmd.exe set to use UTF-8 by chcp 65001, sys.stdin.encoding and sys.stdout.encoding will be 'cp65001', which is not recognized as being the same as UTF-8. (If the encoding was actually something other than UTF-8, I think that producing mojibake on stdout is better than throwing an exception. Note that terminals that do bad things when they receive control characters are already broken; this isn't making it worse, because plain ASCII can include such control characters.)

There are other issues with the patch; I just wanted to comment on this part while it's swapped into my brain (and people are now calling me to go and play scrabble, which sounds more fun :-)

comment:76 in reply to: ↑ 75 Changed at 2009-12-06T22:29:25Z by davidsarah

Replying to davidsarah:

catch LookupError:

Python was obviously not fully swapped in -- s/catch/except/.

comment:77 Changed at 2009-12-07T03:18:12Z by davidsarah

Priority changed from minor to major
Summary changed from "tahoe cp" command encoding issue to CLI (e.g. tahoe cp) does not handle Unicode file/directory names, either in arguments or the local filesystem

comment:78 Changed at 2009-12-07T03:19:21Z by davidsarah

Summary changed from CLI (e.g. tahoe cp) does not handle Unicode file/directory names, either in arguments or the local filesystem to CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem

comment:79 Changed at 2009-12-07T03:52:53Z by davidsarah

Sigh. I misread the exception message in #734. It is complaining about sys.stdout.encoding being None when stdout is redirected, not about the input string being None. So the fix for that should be something like

def unicode_to_stdout(s):
    """
    Encode an unicode object for representation on stdout.
    """

    precondition(isinstance(s, unicode), s)

    enc = sys.stdout.encoding
    if enc is None or enc == 'cp65001':
        enc = 'utf-8'

    try:
        return s.encode(enc, 'replace')
    catch LookupError:
        return s.encode('utf-8', 'replace')  # maybe

It might also be a good idea to write a BOM at the start of the output. That would allow the encoding to be autodetected if a file containing redirected output is edited, which is helpful at least on Windows, and should be harmless on other platforms.

At least I did better at scrabble (won by 3 points :-)

comment:80 Changed at 2010-01-27T06:07:41Z by zooko

Milestone changed from eventually to 1.7.0

comment:81 Changed at 2010-02-12T05:03:49Z by davidsarah

Keywords tahoe-cp added; cp removed

comment:82 Changed at 2010-03-12T17:49:01Z by davidsarah

Keywords gsoc added

Changed at 2010-04-24T00:11:27Z by francois

Attachment unicode-helper-functions.diff added

This patch contains Unicode helper functions (stringutils.py) and associated tests

comment:83 Changed at 2010-04-24T00:22:43Z by francois

I'd be very grateful if someone could review patch attachment:unicode-helper-functions.diff

Then, I'll post a second patch which will modify the CLI code to actually use those helper functions.

comment:84 Changed at 2010-04-24T04:37:10Z by zooko

Owner changed from francois to zooko
Status changed from new to assigned

I will review it. If I understand correctly François's strategy is to first write patches to handle the case that the bytes decode without error using the nominal encoding. This avoids all the tricky problems with bytes that don't decode according to the nominal encoding.

I'm excited about having this mocking tool so we can conveniently express platform-specific behavior in a deterministic, automated, cross-platform test.

comment:85 Changed at 2010-04-25T20:30:19Z by francois

I plan to add the following text to the NEWS file to tell users the current state of Unicode support.

** Bugfixes

*** Unicode filenames handling

Tahoe CLI commands working on local files, for instance 'tahoe cp' or 'tahoe
backup', have been improved to correctly handle filenames containing non-ASCII
characters.

In the case where Tahoe encounters a filename which cannot be decoded using the
system encoding, an error will be returned and the operation will fail.  Under
Linux, this typically happens when the filesystem contains filenames encoded
with another encoding, for instance latin1, than the system locale, for
instance UTF-8.  In such case, you'll need to fix your system with tools such
as 'convmv' before using Tahoe CLI.

All CLI commands have been improved to support non-ASCII parameters such as 
filenames and aliases on all supported Operating Systems.

Changed at 2010-05-04T07:27:42Z by francois

Attachment unicode-filenames-handling.diff added

comment:86 Changed at 2010-05-04T07:33:06Z by francois

Patch attachment:unicode-filenames-handling.diff makes use of helper functions from attachment:unicode-helper-functions.diff to handle Unicode filenames in the CLI.

As for now, only filenames which are encoded in a coherent way according to the system locale are handled. In the other case, an error message is displayed and the current operation is stopped.

Unicode CLI argument in Windows cmd.exe doesn't work. See ticket #565 on this issue.

comment:87 Changed at 2010-05-11T15:52:49Z by zooko

Keywords review-needed added

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.

comment:88 Changed at 2010-05-11T15:53:02Z by zooko

Owner changed from zooko to nobody
Status changed from assigned to new

comment:89 Changed at 2010-05-11T17:51:13Z by davidsarah

Owner changed from nobody to davidsarah
Status changed from new to assigned

comment:90 follow-up: ↓ 91 Changed at 2010-05-14T04:36:13Z by zooko

Milestone changed from 1.7.0 to 1.8.0

Here is my review of attachment:unicode-filenames-handling.diff and attachment:unicode-helper-functions.diff .

Overall this is a good start. I'm glad we have a way to do real thorough tests of encoding (by mocking up stdout, sys, and other components of the system). This patch isn't ready to go into trunk so we won't ship it in v1.7 but at this rate we can definitely get it all polished up for v1.8.

Here are my comments:

I like the NEWS file text. :-)

The unit tests of the helper functions look good! Please add tests of the two cases where listdir_unicode_unix() would raise FilenameEncodingError.

     # If you force Windows cmd.exe set to use UTF-8 by typing 'chcp 65001',
     # sys.stdin.encoding and sys.stdout.encoding will be set to 'cp65001',
     # which is not recognized as being the same as UTF-8.

What does this comment mean? Recognized by whom? The code looks like it detects this and corrects it, so I guess you don't mean recognized by Tahoe-LAFS.

     if enc == 'cp850':
         enc = 'ISO-8859-1'

This looks wrong! cp850 and iso-8859-1 are different encodings. If I am right that this is wrong then it should be possible to write a test case that causes Tahoe-LAFS to incorrectly decode an input as long as this statement is still in the code.

Oh and look what I just found! It appears that cp65001 and utf-8 are also different: http://bugs.python.org/issue6058 (specifically http://bugs.python.org/msg97731 ). Let's see if we can write a unit test that goes red when Tahoe-LAFS (or perhaps the underlying Python libraries) incorrectly equate cp65001 and utf-8.

def argv_to_unicode(s):
...
         return unicode(s, get_stdout_encoding())
...
         raise usageError("Argument '%s' cannot be decoded as UTF-8." % s)

Is stdout encoding guaranteed to be the same as argv encoding? Maybe we should rename get_stdout_encoding() or at least add a comment to argv_to_unicode() explaining that we use stdout encoding for this. Also the UsageError has the wrong text. Also it has the wrong name of the exception class! This proves that the line is not covered by unit tests. Please add a unit test of what happens with something that can't be decoded as the stdout encoding. :-)

def unicode_to_stdout(s):
...
    except LookupError:
        return s.encode('utf-8', 'replace')  # maybe

If we're going to do this fallback let's have a unit test of it. This means a test of the case that get_stdout_encoding() returns an encoding that Python doesn't know how to do, right!? Maybe let's not have a test of that and also not have that code at all and let it raise LookupError in that case. :-)

"encoutered" -> "encountered"

Finish the test for open_unicode().

Add a test that the normalization is useful, i.e. a test that is red if we skip the unicodedata.normalize() step.

There are a couple of hard-coded 'utf-8' and a literal 'ascii' in attachment:unicode-filenames-handling.diff . Some of them may be correct, for example if it is encoding to a URL, but in that case maybe we should invoke unicode_to_url() instead (mostly for documentation purposes). Some of them appear to be wrong to me -- aliases files shouldn't be hard-coded to utf-8... Oh wait, yes they should, right? Then please add doc stating so. Finally the encoding for stdout in tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'. Add a unit test showing that the current code with the hard-coded 'utf-8' fails on a system where stdout needs a different encoding.

In test_cli.py, if the test_listdir_unicode_bad() test doesn't make sense on the current platform, then raise SkipTest instead of returning. Finally, use a code coverage tool (I recommend trialcoverage) to see if any of the lines changed by these patches are not-fully-tested by these tests.

comment:91 in reply to: ↑ 90 Changed at 2010-05-16T22:06:49Z by francois

Zooko, thank you very much for the review.

Replying to zooko:

Here is my review of attachment:unicode-filenames-handling.diff and attachment:unicode-helper-functions.diff .

Overall this is a good start. I'm glad we have a way to do real thorough tests of encoding (by mocking up stdout, sys, and other components of the system). This patch isn't ready to go into trunk so we won't ship it in v1.7 but at this rate we can definitely get it all polished up for v1.8.

Please find an updated version of the patches which takes all your comments into account and has now 100% test coverage.

The unit tests of the helper functions look good! Please add tests of the two cases where listdir_unicode_unix() would raise FilenameEncodingError.

Done

     # If you force Windows cmd.exe set to use UTF-8 by typing 'chcp 65001',
     # sys.stdin.encoding and sys.stdout.encoding will be set to 'cp65001',
     # which is not recognized as being the same as UTF-8.
What does this comment mean? Recognized by whom? The code looks like it detects this and corrects it, so I guess you don't mean recognized by Tahoe-LAFS.
     if enc == 'cp850':
         enc = 'ISO-8859-1'
This looks wrong! cp850 and iso-8859-1 are different encodings. If I am right that this is wrong then it should be possible to write a test case that causes Tahoe-LAFS to incorrectly decode an input as long as this statement is still in the code.

Oh and look what I just found! It appears that cp65001 and utf-8 are also different: http://bugs.python.org/issue6058 (specifically http://bugs.python.org/msg97731 ). Let's see if we can write a unit test that goes red when Tahoe-LAFS (or perhaps the underlying Python libraries) incorrectly equate cp65001 and utf-8.

The idea behind this piece code came from 75 and 79. Anyway, I removed it completely because getting correct Unicode command-line arguments under Windows will require calling the Unicode Windows API function GetCommandLineW, see ticket #565.

In the meantime, I've marked all Windows test cases as todo.

def argv_to_unicode(s):
...
         return unicode(s, get_stdout_encoding())
...
         raise usageError("Argument '%s' cannot be decoded as UTF-8." % s)
Is stdout encoding guaranteed to be the same as argv encoding? Maybe we should rename get_stdout_encoding() or at least add a comment to argv_to_unicode() explaining that we use stdout encoding for this. Also the UsageError has the wrong text. Also it has the wrong name of the exception class! This proves that the line is not covered by unit tests. Please add a unit test of what happens with something that can't be decoded as the stdout encoding. :-)

The function was renamed get_term_encoding(), the wrong comment was rewritten and the error message now contains the guessed terminal encoding.

def unicode_to_stdout(s):
...
    except LookupError:
        return s.encode('utf-8', 'replace')  # maybe
If we're going to do this fallback let's have a unit test of it. This means a test of the case that get_stdout_encoding() returns an encoding that Python doesn't know how to do, right!? Maybe let's not have a test of that and also not have that code at all and let it raise LookupError in that case. :-)

According to the email message linked from ticket #734, users probably prefers having a slightly modified ouput in the terminal (unknown characters replaced by '?') than an exception.

"encoutered" -> "encountered"

Fixed :)

Finish the test for open_unicode().

Done.

Add a test that the normalization is useful, i.e. a test that is red if we skip the unicodedata.normalize() step.

Done in test_unicode_normalization().

There are a couple of hard-coded 'utf-8' and a literal 'ascii' in attachment:unicode-filenames-handling.diff . Some of them may be correct, for example if it is encoding to a URL, but in that case maybe we should invoke unicode_to_url() instead (mostly for documentation purposes).

I already invoke unicode_to_url() instead of .encode('utf-8') in every place. The two remaining hard-coded 'utf-8' occurrences are for the alias file encoding.

Hard-coded 'ascii' is used for representing capabilities, it might be better to create Capability objects instead of using str but that's most likely out of the scope of this patch.

Some of them appear to be wrong to me -- aliases files shouldn't be hard-coded to utf-8... Oh wait, yes they should, right? Then please add doc stating so. Finally the encoding for stdout in tahoe_manifest.py definitely seems wrong to be hard-coded to 'utf-8'. Add a unit test showing that the current code with the hard-coded 'utf-8' fails on a system where stdout needs a different encoding.

Yes, the alias file is read and written as UTF-8. A note was added in frontends/CLI.txt.

In test_cli.py, if the test_listdir_unicode_bad() test doesn't make sense on the current platform, then raise SkipTest instead of returning.

Done.

Finally, use a code coverage tool (I recommend trialcoverage) to see if any of the lines changed by these patches are not-fully-tested by these tests.

I had trouble making use of test-coverage Makefile target, so I finally used coverage.py directly.

Changed at 2010-05-17T08:07:19Z by francois

Attachment unicode-helper-functions-v2.diff added

Changed at 2010-05-17T08:07:44Z by francois

Attachment unicode-filenames-handling-v2.diff added

comment:92 follow-up: ↓ 93 Changed at 2010-05-17T14:35:46Z by zooko

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)

Can you (or have you already) written a unit test which exercises the fallback: except LookupError?: return s.encode('utf-8', 'replace')?

As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared sys.getfilesystemencoding() then an exception is raised. Is this exercised by a unit test?

I can't think of any other issues -- we just need someone to review the latest patch.

comment:93 in reply to: ↑ 92 ; follow-up: ↓ 94 Changed at 2010-05-17T23:14:23Z by francois

Replying to zooko:

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)

Shouldn't capabilities be defined as plain ASCII?

Can you (or have you already) written a unit test which exercises the fallback: except LookupError?: return s.encode('utf-8', 'replace')?

Yes, in test_stringutils.py line 70:

  @patch('sys.stdout') 
  def test_unicode_to_stdout(self, mock): 
    # Encoding koi8-r cannot represent 'è' 
    mock.encoding = 'koi8-r' 
    self.failUnlessEqual(unicode_to_stdout(u'lumière'), 'lumi?re')

As we discussed on IRC, this version doesn't do any lossy decoding from local filesystem into unicode -- if a filename can't be decoded with the declared sys.getfilesystemencoding() then an exception is raised. Is this exercised by a unit test?

Yes, line 98 of test_stringutils.py:

  @patch('sys.getfilesystemencoding') 
    def test_open_unicode(self, mock): 
    mock.return_value = 'ascii' 
  
    self.failUnlessRaises(FilenameEncodingError, 
                          open_unicode, 
                          u'lumière')

I can't think of any other issues -- we just need someone to review the latest patch.

comment:94 in reply to: ↑ 93 ; follow-up: ↓ 96 Changed at 2010-05-18T03:15:10Z by zooko

Replying to francois:

Replying to zooko:

Let's define capabilities as being utf-8 encoded for forward-compatibility purposes. (No new unit test is required to test this change.)

Shouldn't capabilities be defined as plain ASCII?

Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there?

comment:95 follow-up: ↓ 97 Changed at 2010-05-18T12:57:28Z by zooko

USSJoin and drewp had this to say on IRC:

<USSJoin> Zooko: Still need someone to look at that?		        [22:19]
<drewp> that does -not- look like anything i'd commit 10 hours before a
	release (unless it has already had a long testing period). Maybe with
	an env var guard, i would
<drewp> oh i see, it's had 1+ year of work already		        [22:20]
<davidsarah> the latest patches haven't, though			        [22:21]
<davidsarah> the approach changed quite a bit
<USSJoin> Zooko, Oh Hudson Master: Do the tests run? :-)	        [22:23]
<USSJoin> Zooko: Only other thing I'd add is that I'm concerned about the
	  preconditions— does that library fail more gracefully than a simple
	  assert()? If not, the whole pancake house is going to fall down,
	  very very hard.
<USSJoin> Other than that, it's been incredibly well-harassed by the whole
	  Tahoe team, I see. So I'd +1.				        [22:27]

comment:96 in reply to: ↑ 94 Changed at 2010-05-18T23:19:29Z by francois

Replying to zooko:

Future caps might include non-ascii characters. And some of the current code might be able to process caps of a future version. (Forward-compatibility) There's no reason not to define the current ones as utf-8 is there?

Ok, I see your point, I'll modify the two .encode('ascii') into .encode('utf-8') into file src/allmydata/scripts/common.py and write a unit test which add an alias to capability from the future ('fi-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧迴') as proposed and tries to get it back.

Changed at 2010-05-18T23:26:26Z by francois

Attachment unicode-filenames-handling-v3.2.diff added

Changed at 2010-05-19T00:41:41Z by francois

Attachment additionnal-tests.diff added

comment:97 in reply to: ↑ 95 Changed at 2010-05-19T00:52:37Z by francois

Replying to zooko:

<USSJoin> Zooko: Only other thing I'd add is that I'm concerned about the

preconditions— does that library fail more gracefully than a simple assert()? If not, the whole pancake house is going to fall down, very very hard.

The preconditions are only there for extra safety to prevent programmers from using the library in unexpected ways. In normal usage, they should not be triggered unless because of a bug.

The only user visible exception for now is FilenameEncodingError. It will probably get catched and logged in the future to avoid breaking a complete backup session because of a single badly encoded file.

comment:98 follow-up: ↓ 105 Changed at 2010-05-19T00:56:31Z by francois

Milestone changed from 1.8.0 to 1.7.0

Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough.

comment:99 Changed at 2010-05-19T02:58:21Z by zooko

This patch seems to be a little bit messed up by the "*END OF DESCRIPTION*" line being accidentally included in the description part, apparently because of whitespace at the beginning of the "*END OF DESCRIPTION*": http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/534/unicode-filenames-handling-v3.2.diff

comment:100 Changed at 2010-05-19T03:39:39Z by zooko

Also that patch doesn't apply cleanly to current trunk. Could you please attach your darcs patch (the result of darcs send -o darcs.patch.txt)? http://tahoe-lafs.org/trac/tahoe-lafs/wiki/Patches

comment:101 Changed at 2010-05-19T04:23:41Z by zooko

Make that darcs send -o unicode-filenames-handling-v3.2.dpatch.

comment:102 Changed at 2010-05-19T04:28:18Z by zooko

I applied the patch attachment:unicode-helper-functions-v2.diff and ran the tests and got some failures:

Ran 42 tests in 0.155s

FAILED (expectedFailures=6, failures=2, unexpectedSuccesses=12, successes=22)

The failure were:

[FAIL]: allmydata.test.test_stringutils.StringUtilsErrors.test_listdir_unicode

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 89, in test_listdir_unicode
    u'/dummy')
twisted.trial.unittest.FailTest: <type 'exceptions.TypeError'> raised instead of FilenameEncodingError:
 Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 117, in maybeDeferred
    result = f(*args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/utils.py", line 191, in runWithWarningsSuppressed
    result = f(*a, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 89, in test_listdir_unicode
    u'/dummy')
--- <exception caught here> ---
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/trial/unittest.py", line 260, in failUnlessRaises
    result = f(*args, **kwargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/util/stringutils.py", line 117, in listdir_unicode
    return [unicodedata.normalize('NFC', fname) for fname in dirlist]
exceptions.TypeError: normalize() argument 2 must be unicode, not str

and

[FAIL]: allmydata.test.test_stringutils.StringUtilsErrors.test_open_unicode

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 104, in test_open_unicode
    u'lumière')
twisted.trial.unittest.FailTest: <type 'exceptions.IOError'> raised instead of FilenameEncodingError:
 Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 117, in maybeDeferred
    result = f(*args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/internet/utils.py", line 191, in runWithWarningsSuppressed
    result = f(*a, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/mock-0.6.0-py2.5.egg/mock.py", line 196, in patched
    return func(*args, **keywargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_stringutils.py", line 104, in test_open_unicode
    u'lumière')
--- <exception caught here> ---
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-10.0.0-py2.5-macosx-10.3-i386.egg/twisted/trial/unittest.py", line 260, in failUnlessRaises
    result = f(*args, **kwargs)
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/util/stringutils.py", line 128, in open_unicode
    return open(path, mode)
exceptions.IOError: [Errno 2] No such file or directory: u'lumi\xe8re'

This is on Mac OS 10.4.11.

comment:103 Changed at 2010-05-19T04:48:48Z by zooko

I applied attachment:additionnal-tests.diff and ran the tests and got a failure:

allmydata.test.test_cli
  CreateAlias
    test_create ...                                                     [ERROR]

===============================================================================
[ERROR]: allmydata.test.test_cli.CreateAlias.test_create

Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/tahoe-lafs/trunk/ticket534/src/allmydata/test/test_cli.py", line 600, in <lambda>
    get_aliases(self.get_clientdir())[u"études"] + "/lumière.txt"))
exceptions.KeyError: u'\xe9tudes'
-------------------------------------------------------------------------------
Ran 1 tests in 1.064s

FAILED (errors=1)

comment:104 Changed at 2010-05-19T06:30:16Z by zooko

The issue of caps with non-ASCII chars has been spun out into #1051 (capabilities from the future could have non-ascii characters).

comment:105 in reply to: ↑ 98 ; follow-up: ↓ 109 Changed at 2010-05-19T07:02:35Z by zooko

Replying to francois:

Ok, those are the patches which should be applied to trunk if the review is considered sufficient enough.

attachment:unicode-helper-functions-v2.diff
attachment:unicode-filenames-handling-v3.2.diff
attachment:additionnal-tests.diff

Reviewing these three now.

Some tests fail on my Mac (see comment:102 and comment:103). Can we use the magic of mockery to make the same tests fail in the same way on every platform where they are run? (Oh yeah and also then fix them. :-))
In unicode_to_stdout() I think that if we get UnicodeEncodeError then we should add the 'replace' flag but using the same encoding (the return value of get_term_encoding()), but we should not catch and try to handle LookupError. Oh, in fact that means that we should omit the try: … except: block entirely and just do return s.encode(get_term_encoding(), 'replace'). What do you think? If you think we should do something different then construct a unit test that we fail unless we do something different. :-)

Consider this code:

        # Use unicode strings when calling os functions 
        if sys.getfilesystemencoding() == "ANSI_X3.4-1968": 
            fn1 = os.path.join(self.basedir, u"Artonwall") 
        else: 
            fn1 = os.path.join(self.basedir, u"Ärtonwall")

Is this if statement obsolete now that we have proper mockery-based tests? Or is do_cli() a higher layer of functionality which should be tested in addition to the current tests? Or should do_cli() itself get mockerified so that it will run the same on all platforms? (Maybe it is good to test the actual non-mockerified underlying platform sometimes?) What is the intent of this if -- to test how do_cli() handles non-ASCII chars on those filesystems which can handle them? If that is the case then add a comment to that effect and make the test actually test whether the underlying system can encode Ä rather than whether the underlying system is ANSI_X3.4-1968. (For example, I guess a Japanese system in SHIFT-JIS isn't ANSI_X3.4-1968 and also can't represent Ä.

Way to go on adding a test whose sole job is to go red if we remove the call to unicodedata.normalize(). I just went and removed that call to see that test go red and finally gained a dim understanding of what that call is for. :-)

comment:106 Changed at 2010-05-19T07:03:44Z by zooko

Keywords review-needed removed
Owner changed from davidsarah to francois
Status changed from assigned to new

Unsetting the "review-needed" flag until François addresses the comments so far.

comment:107 Changed at 2010-05-19T07:14:07Z by zooko

To test this feature on your system apply François's most recent patches (from comment:98 unless he has posted new ones since then) and try to upload, download, move, rename, and backup files with non-ASCII characters in their names. Also directories. Use the WUI, the CLI, the WAPI, or tahoe backup as you prefer. For bonus points try installing the SFTP patches from #1037 and test filenames with non-ASCII characters through SFTP, sshfs, Nautilus, etc. :-)

comment:108 Changed at 2010-05-19T23:37:53Z by freestorm

I did some tests on my Windows Box.

The directory TEST contain this filename (in one line):
LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷¸°¨·¹³² _1.txt

Test 1:

D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe cp --recursive TEST ROOT:test_534

Success: files copied

=> I copied back this file on my disk and the md5sum was the same as the source file.

Test 2:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir ROOT:à_testé

URI:DIR2:wskq3uymjffqq6zidua7xrc3oq:crnp2xhec76wwjwghx2ddae5nenxv365id7nx56tepwdzwlet3ha => no errors,but next test fails.

Test 3:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:
Ó_testÚ
=> Ó_testÚ <> à_testé
PS: same output with the WUI

Test 4:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe mkdir "ROOT:LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷ ¸°¨·¹³²"
URI:DIR2:64cxbqxkd2rj4lxpuragkltrpu:5rjnyv3cnvdavg7uprmkb2f2fvncasdi6t6k3ert3t5htbpffaca
=> no errors,but next test fails.

Test 5:

D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:
LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~'''Ã³ÚÔõÓÕþÛÙÞ´¯ý─┼╔µã¶÷‗¹¨ Í▄°úÏÎßÝ¾·±Ð¬║┐«¼¢╝í½╗┴┬└®óÑÒ├ñð╩╦╚═╬¤ª╠Ë▀ÈÊ§ıÁ■Ì┌█┘²¦»┤¡▒¥Âº¸©░¿À╣│▓'''

=> result is the same as test 3, not the same name as source.

test 6:

D:\rdfc>D:\tahoe-fdz_534\bin\tahoe.exe backup TEST ROOT:test_534

1 files uploaded (0 reused), 0 files skipped, 1 directories created (0 reused), 0 directories skipped backup done, elapsed time: 0:00:15

Test 7:
D:\rdfc\TEST>D:\tahoe-fdz_534\bin\tahoe.exe ls ROOT:test_534/Archives/2010-05-19_22:08:36Z

LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñÑªº¿®¬½¼¡«»ÁÂÀ©¢¥ãÃ¤ðÐÊËÈÍÎÏ¦ÌÓßÔÒõÕµþÞÚÛÙýÝ¯´±¾¶§÷¸°¨·¹³² _1.txt

=> the filename is the same as source.

Fred

comment:109 in reply to: ↑ 105 Changed at 2010-05-20T00:43:46Z by francois

Replying to zooko:

Some tests fail on my Mac (see comment:102 and comment:103). Can we use the magic of mockery to make the same tests fail in the same way on every platform where they are run? (Oh yeah and also then fix them. :-))

IMHO, tests in test_cli.py should exercise the actual usage of the CLI tools and thus should not use mocking at all. Whereas, tests test_stringutils.py exercise the stringutils library and use mocking to simulate all possible platforms.

In unicode_to_stdout() I think that if we get UnicodeEncodeError then we should add the 'replace' flag but using the same encoding (the return value of get_term_encoding()), but we should not catch and try to handle LookupError. Oh, in fact that means that we should omit the try: … except: block entirely and just do {{{return s.encode(get_term_encoding(), 'replace')}}}. What do you think? If you think we should do something different then construct a unit test that we fail unless we do something different. :-)

You're right, this exception handling is not very pretty. I chose your proposition and modified get_term_encoding() to return 'ascii' for sys.stdout.encoding value is None. A new unit test has been written as well.

def get_term_encoding():
    """
    Returns expected encoding for writing to the terminal and reading
    arguments from the command-line.
    """

    if sys.stdout.encoding == None:
        return 'ascii'
    else:
        return sys.stdout.encoding

Consider this code:
        # Use unicode strings when calling os functions 
        if sys.getfilesystemencoding() == "ANSI_X3.4-1968": 
            fn1 = os.path.join(self.basedir, u"Artonwall") 
        else: 
            fn1 = os.path.join(self.basedir, u"Ärtonwall") 
Is this if statement obsolete now that we have proper mockery-based tests? Or is do_cli() a higher layer of functionality which should be tested in addition to the current tests? Or should do_cli() itself get mockerified so that it will run the same on all platforms? (Maybe it is good to test the actual non-mockerified underlying platform sometimes?) What is the intent of this if -- to test how do_cli() handles non-ASCII chars on those filesystems which can handle them? If that is the case then add a comment to that effect and make the test actually test whether the underlying system can encode Ä rather than whether the underlying system is ANSI_X3.4-1968. (For example, I guess a Japanese system in SHIFT-JIS isn't ANSI_X3.4-1968 and also can't represent Ä.

Yes, this statement is obsolete and has been removed. I'd say that the test should be skipped if it is not possible to write abritrary filenames on the filesystem or pass arbitrary arguments on the command line on the platform on which the tests are being run.

Changed at 2010-05-20T00:54:48Z by francois

Attachment unicode-helper-functions-v4.diff added

Changed at 2010-05-20T00:55:07Z by francois

Attachment unicode-filenames-handling-v4.diff added

Changed at 2010-05-20T00:55:19Z by francois

Attachment unicode-bundle-v4.darcspatch added

comment:110 Changed at 2010-05-20T01:00:14Z by francois

The latest patches which take comment 109 into account have been uploaded. They run without failure under Ubuntu 10.04 and Windows XP.

And a darcs patch which correctly apply to trunk and contains the two previous patches and a third one which add a dependency on mock library (#1016).

attachment:unicode-bundle-v4.darcspatch

comment:111 Changed at 2010-05-20T05:31:15Z by zooko

The latest attachment:unicode-bundle-v4.darcspatch still fails unit test on my Mac. It occurred to me that I can program the buildbots to run your branch and report! This could be useful! Here is your branch on a new trac instance:

http://tahoe-lafs.org/trac/tahoe-lafs-ticket534/log/

and here is the result of me forcing all the buildslaves to try to build and test your branch at 22:12:09 just now:

http://tahoe-lafs.org/buildbot/waterfall?show_events=true&branch=ticket534&reload=none

:-)

Now you can see how your branch fares on all our buildslaves!

Send me your ssh pub key and I'll set it up so that you can push patches directly to that branch on http://tahoe-lafs.org and then push a button on the buildbot web page to force all buildslaves to build your latest patches.

comment:112 Changed at 2010-05-22T14:07:43Z by francois

Keywords review-needed added

Ok, all tests now pass on all supported platforms or are skipped if irrelevant to the platform.

So, please review the 7 new patches which have been added after unicode-helper-functions-v4.diff and unicode-filenames-handling-v4.diff .

comment:113 Changed at 2010-05-22T14:08:21Z by francois

Owner changed from francois to nobody

comment:114 Changed at 2010-06-04T22:39:08Z by francois

The patches were commited in trunk by Zooko: b2542b87085be095 -> 0eb4d83937a97094.

I'm now reviewing the three patches for this functionnality which were added by Zooko.

About patch 442008a6905d7374, the change in {tahoe_manifest.py} makes sense.

However, I don't agree with the changes made in test_cli.py. Because calling method self.do_cli with UTF-8 encoded arguments only makes sense of the platform on which the test is run does also encodes CLI arguments in UTF-8. This assumption is wrong in all other cases, such as terminals which only support ASCII characters. This is probably why the tests are failing on many buildbots.

About patch db8a6f3aa6d78122, using locale.getpreferredencoding()}} instead {{{sys.stdout.encoding looks like a good idea. Changes in functions skip_non_unicode_fs and skip_non_unicode_stdout looks also good. Same comment than before on the usage of self.do_cli.

Patch 5bcca5151eb897e6, Hey this one is easy, OK ;)

comment:115 Changed at 2010-06-07T04:51:43Z by zooko

David-Sarah added to the tests some nice extensions. They and Brian Warner and I chatted on IRC and agreed that {{{tahoe ls}}] ought to report the fact that a filename was unrepresentable if it was, and which filename (as represented by an escaped/quoted representation such as \u1234 for each unicode code point). David-Sarah implemented that. David-Sarah posted their new patch here and I just reviewed it: http://allmydata.org/trac/tahoe-lafs-ticket534/changeset/4434

François: please by all means review this patch and any other new patches from David-Sarah on the topic of unicode!

Oh. I found one thing in my review: open_unicode() should not do anything that the builtin open() function doesn't already do. Fortunately now that we have such thorough tests we should be able to verify that by redefining it like this:

def open_unicode(filename, mode):
    return open(filename, mode)

and see if the tests pass. In David-Sarah's latest patches, they added a call to os.path.expanduser() into open_unicode(). I think that putting expanduser() into some central location so that we'll never forget to use it makes sense, so perhaps we'll end up renaming open_unicode() to open_cu():

def open_xu(filename, mode):
    return open(os.path.expanduser(filename, mode))

comment:116 follow-up: ↓ 119 Changed at 2010-06-08T10:12:56Z by francois

I'm currently reviewing changeset 80252debcd94fc28.

In util/stringutils.py:

- def open_unicode(path, mode='r'): 
+ def open_unicode(path, mode):

Why shouldn't open_unicode behave just like open by default?

I see that a bunch of preconditions were left out, especially in to_str but that's hopefully not too dangerous.

comment:117 follow-up: ↓ 118 Changed at 2010-06-08T10:52:20Z by francois

OK, I reviewed 65b6f4e3ce4cdf13, 529c9f673a89f7fa, 8b014372b1ab9460, 731e3d68dff0f8c9 and 3c883e6e44e82d1e.

I don't really understand why test_listdir_unicode_bad was removed in 3c883e6e44e82d1e.

Changed at 2010-06-09T04:22:53Z by zooko

Attachment misc-update.dpatch added

comment:118 in reply to: ↑ 117 Changed at 2010-06-16T04:13:54Z by davidsarah

Replying to francois:

I don't really understand why test_listdir_unicode_bad was removed in 3c883e6e44e82d1e.

The test was failing on Mac OS X, which allows badly-encoded filenames to be written, but then lists them using %-escaping. This is reasonable, but couldn't have been predicted, and it's impossible to predict what behaviour other filesystems might have that would break the test. So Zooko and I decided that the test was invalid.

Note that the test_listdir_unicode test in StringUtilsNonUnicodePlatform mocks the return from os.listdir and checks that listdir_unicode raises a FilenameEncodingError when it should. We decided that this was a sufficient test, without actually trying to create the badly encoded filenames on the local filesystem.

comment:119 in reply to: ↑ 116 Changed at 2010-06-16T04:17:26Z by davidsarah

Replying to francois:

I'm currently reviewing changeset 80252debcd94fc28.

In util/stringutils.py:
- def open_unicode(path, mode='r'): 
+ def open_unicode(path, mode): 
Why shouldn't open_unicode behave just like open by default?

We should never be opening files in text mode.

I see that a bunch of preconditions were left out, especially in to_str but that's hopefully not too dangerous.

to_str is intended to accept either a Unicode or (UTF-8) byte string. It is used to coerce strings from the output of simplejson.loads, which might be either.

comment:120 Changed at 2010-06-16T04:18:21Z by davidsarah

Resolution set to fixed
Status changed from new to closed

comment:121 Changed at 2010-06-16T04:19:38Z by davidsarah

Fixed modulo the normalization issue in ticket #1076, which we intend to address for 1.7final.

comment:122 Changed at 2010-06-17T05:11:47Z by zooko

Keywords reviewed added; review-needed removed

Note: See TracTickets for help on using tickets.

Download in other formats:

Context Navigation

#534 closed defect (fixed)

CLI (e.g. tahoe cp) does not correctly handle or display Unicode file/directory names, either in arguments or the local filesystem

Description

Attachments (45)

Change History (167)

comment:1 Changed at 2008-11-11T20:38:51Z by francois

comment:2 Changed at 2008-11-12T13:47:34Z by zooko

Changed at 2008-11-13T11:30:50Z by francois

comment:3 Changed at 2008-11-13T11:33:01Z by francois

comment:4 Changed at 2008-11-13T13:52:06Z by zooko

comment:5 Changed at 2008-11-13T20:32:59Z by warner

comment:6 Changed at 2008-11-13T21:03:14Z by warner

comment:7 Changed at 2008-11-14T10:24:41Z by francois

Changed at 2008-11-14T14:04:15Z by francois

Changed at 2008-11-14T14:07:56Z by francois

comment:8 follow-up: ↓ 10 Changed at 2008-11-15T00:01:00Z by warner

comment:9 Changed at 2008-11-15T00:10:35Z by warner

comment:10 in reply to: ↑ 8 Changed at 2008-11-16T00:11:21Z by francois

Changed at 2008-11-16T00:11:47Z by francois

comment:11 Changed at 2008-11-16T01:48:10Z by zooko

comment:12 Changed at 2008-11-16T02:11:29Z by zooko

comment:13 Changed at 2008-11-18T20:21:02Z by zooko

comment:14 Changed at 2008-11-18T20:22:12Z by zooko

comment:15 Changed at 2008-11-18T21:25:53Z by zooko

comment:16 Changed at 2008-11-18T23:34:31Z by zooko

comment:17 Changed at 2008-11-20T00:42:57Z by warner

comment:18 Changed at 2008-11-24T11:20:52Z by francois

comment:19 Changed at 2008-11-24T13:56:23Z by zooko

comment:20 Changed at 2008-11-25T11:28:56Z by francois

comment:21 Changed at 2008-11-25T15:15:05Z by zooko

comment:22 Changed at 2008-11-25T15:16:27Z by zooko

comment:23 Changed at 2008-11-25T16:37:57Z by francois

comment:24 Changed at 2008-11-25T17:19:00Z by francois

Changed at 2008-11-25T17:19:39Z by francois

comment:25 Changed at 2008-11-25T20:07:34Z by zooko

comment:26 Changed at 2008-11-25T20:09:45Z by zooko

comment:27 Changed at 2008-11-25T20:11:44Z by zooko

comment:28 Changed at 2008-11-25T22:05:39Z by zooko

comment:29 Changed at 2008-11-25T22:05:46Z by zooko

comment:30 Changed at 2008-11-25T22:59:00Z by zooko

comment:31 Changed at 2008-11-26T02:45:37Z by warner

comment:32 Changed at 2008-12-08T04:33:55Z by zooko

comment:33 Changed at 2008-12-08T04:34:44Z by zooko

comment:34 Changed at 2008-12-08T10:59:40Z by francois

comment:35 Changed at 2008-12-08T14:33:40Z by zooko

comment:36 follow-up: ↓ 37 Changed at 2008-12-09T02:13:30Z by warner

comment:37 in reply to: ↑ 36 Changed at 2008-12-22T11:39:48Z by francois

comment:38 Changed at 2008-12-24T20:43:40Z by zooko

Changed at 2009-02-17T11:57:45Z by francois

comment:39 Changed at 2009-02-17T18:02:10Z by zooko

comment:40 Changed at 2009-02-17T18:23:05Z by zooko

comment:41 Changed at 2009-02-24T01:09:43Z by francois

comment:42 Changed at 2009-02-24T01:13:10Z by francois

comment:43 Changed at 2009-02-24T01:13:38Z by francois

Changed at 2009-02-24T11:50:41Z by francois

comment:44 Changed at 2009-02-24T11:53:37Z by francois

Changed at 2009-02-25T09:31:55Z by francois

comment:45 Changed at 2009-02-25T09:32:27Z by francois

comment:46 Changed at 2009-03-29T14:10:43Z by francois

comment:47 Changed at 2009-03-31T03:48:44Z by zooko

comment:48 Changed at 2009-04-01T07:26:51Z by warner

comment:49 Changed at 2009-04-01T14:20:40Z by zooko

Changed at 2009-04-01T19:41:52Z by francois

comment:50 Changed at 2009-04-01T19:44:26Z by francois

comment:51 Changed at 2009-04-08T23:26:31Z by zooko

comment:52 Changed at 2009-04-09T02:54:59Z by zooko

comment:53 Changed at 2009-04-09T03:02:29Z by zooko

Changed at 2009-04-09T03:59:02Z by zooko

comment:54 Changed at 2009-04-09T04:01:34Z by zooko

Changed at 2009-04-09T04:02:14Z by zooko

comment:55 Changed at 2009-04-09T04:04:46Z by zooko

Changed at 2009-04-09T04:09:10Z by zooko

comment:56 Changed at 2009-04-09T04:54:37Z by zooko

Changed at 2009-04-09T04:56:05Z by zooko

comment:57 Changed at 2009-04-10T16:51:16Z by zooko

comment:58 Changed at 2009-04-28T16:08:48Z by zooko

comment:59 Changed at 2009-04-28T18:00:54Z by zooko

Changed at 2009-04-28T18:11:37Z by zooko

comment:60 Changed at 2009-04-29T06:11:02Z by zooko