[tahoe-dev] String encoding in tahoe

Tue Dec 23 08:50:49 PST 2008

Dan:

Thanks for your help!

The fact that the ArchLinux buildbot was failing and then started  
working when you set your system locale to UTF-8 is evidence that  
François's and my recent patches are wrong.  :-)

I'm beginning to think that the advice from Kumar McMillan that  
François posted is not right for us.  That advice was:

    1. Decode early
    2. Unicode everywhere
    3. Encode late

This would work fine, of course, if we knew what encoding the input  
comes in, for step 1, and what encoding the recipient expects, for  
step 3.  But we currently don't.

So, for example, the test that fails on Dan's ArchLinux when it has a  
"C" locale accepts a filename argument on the command-line and then  
later passes that filename to os.path.exists().
I don't know how to find out what encoding was used to produce the  
bytestring which appears in sys.argv[2].  I'm not even sure that it  
is possible to find the answer to that in general.  So we could  
assume that the encoding is utf-8.  That's what our recent patches do  
and that's what works on our Ubuntu Feisty buildslave and on Dan's  
ArchLinux buildslave when it has a UTF-8 locale.  However, if the  
argument was not encoded with utf-8, then this will fail or even will  
think that it succeeded but will mangle the filename into gibberish.

So another approach would be:

    1. don't decode
    2. Python 2 "str" type everywhere
    3. don't encode

This would work perfectly for the use case of accepting an argument  
on the command-line and then calling os.path.exists() and passing  
that argument, assuming that each system uses the same encoding for  
command-line arguments as it does for names in its filesystem.

A drawback to this approach is that we can't then safely inspect or  
change the string!  For example, we need to split the string on slash  
chars or assert that the string has no slash chars.  Without knowing  
what encoding was used, or without assuming that the encoding was  
utf-8, there is no way to do that.  (In fact, I should probably open  
a ticket about that issue.  There: #565 (unicode arguments on the  
command-line).)

So in preparation for the imminent Tahoe-LAFS v1.3.0 release, I need  
to do *something* quick and easy.  I'm tempted to assume that all  
sys.argv arguments are utf-8 encoded, and to utf-8 encode outputs  
before passing them to the filesystem in calls like "os.path.exists 
()".  This will probably make it work on Dan's ArchLinux system with  
UTF-8 locale, probably make it work on Ubuntu feisty, probably make  
it fail on ArchLinux systems with C locale (??), and probably make it  
fail on Windows.

Then there would have to be a known issue documented for the 1.3.0  
release saying that non-ascii chars on the command-line are known to  
fail on some platforms.

Any better ideas?  I really can't spend any more time on this myself,  
so hopefully someone who cares about this issue and knows how to  
improve it (like François) will jump in.

Regards,

Zooko
---
Tahoe, the Least-Authority Filesystem -- http://allmydata.org
backup all your files for $10/month -- http://allmydata.com/? 
tracking=zsig