[tahoe-dev] String encoding in tahoe

Tue Dec 23 14:22:27 PST 2008

zooko wrote:

> The fact that the ArchLinux buildbot was failing and then started  
> working when you set your system locale to UTF-8 is evidence that  
> François's and my recent patches are wrong.  :-)

Mmmh, I don't really agree here.

I think that we're trying to test something here which doesn't make
sense, namely using filenames containing non-ascii characters on
platforms which doesn't support it.

We certainly need to find a way to determine the encoding used in
different part of the system: sys.argv and the filesystem.

> So, for example, the test that fails on Dan's ArchLinux when it has a  
> "C" locale accepts a filename argument on the command-line and then  
> later passes that filename to os.path.exists().

It raises an exception instead of checking for the existence of a file
whose name is gibberish which sounds reasonable.

> I don't know how to find out what encoding was used to produce the  
> bytestring which appears in sys.argv[2].  I'm not even sure that it  
> is possible to find the answer to that in general.  So we could  
> assume that the encoding is utf-8.  That's what our recent patches do  
> and that's what works on our Ubuntu Feisty buildslave and on Dan's  
> ArchLinux buildslave when it has a UTF-8 locale.  However, if the  
> argument was not encoded with utf-8, then this will fail or even will  
> think that it succeeded but will mangle the filename into gibberish.

We can probably assume utf-8 encoding for the moment and try to add
better detection mechanism later on when a solution to #565 is found.

> So another approach would be:
> 
>     1. don't decode
>     2. Python 2 "str" type everywhere
>     3. don't encode
> 
> This would work perfectly for the use case of accepting an argument  
> on the command-line and then calling os.path.exists() and passing  
> that argument, assuming that each system uses the same encoding for  
> command-line arguments as it does for names in its filesystem.

This is more or less how it was done previously which brought the issue
of double encoding with simplejson.dumps called with an utf-8 encoded str.

> So in preparation for the imminent Tahoe-LAFS v1.3.0 release, I need  
> to do *something* quick and easy.  I'm tempted to assume that all  
> sys.argv arguments are utf-8 encoded, and to utf-8 encode outputs  
> before passing them to the filesystem in calls like "os.path.exists 
> ()".  This will probably make it work on Dan's ArchLinux system with  
> UTF-8 locale, probably make it work on Ubuntu feisty, probably make  
> it fail on ArchLinux systems with C locale (??), and probably make it  
> fail on Windows.

I'm not sure that utf-8 encoding outputs to these functions will do any
good. This is already what's happening behind the scene but only if the
locale supports it.

$ LANG=C python -c 'import os; a = u"tést"; os.path.exists(a); print "ok"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.5/posixpath.py", line 171, in exists
    st = os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)

$ LANG=en_US.UTF-8 python -c 'import os; a = u"tést"; os.path.exists(a);
print "ok"'
ok
francois at korn:~$

> Then there would have to be a known issue documented for the 1.3.0  
> release saying that non-ascii chars on the command-line are known to  
> fail on some platforms.

Yes, this sounds good for the short term solution, I'll try to figure
out to handle those encoding issues in a cleaner way.

François