[tahoe-dev] String encoding in tahoe
zooko
zooko at zooko.com
Tue Dec 23 08:50:49 PST 2008
Dan:
Thanks for your help!
The fact that the ArchLinux buildbot was failing and then started
working when you set your system locale to UTF-8 is evidence that
François's and my recent patches are wrong. :-)
I'm beginning to think that the advice from Kumar McMillan that
François posted is not right for us. That advice was:
1. Decode early
2. Unicode everywhere
3. Encode late
This would work fine, of course, if we knew what encoding the input
comes in, for step 1, and what encoding the recipient expects, for
step 3. But we currently don't.
So, for example, the test that fails on Dan's ArchLinux when it has a
"C" locale accepts a filename argument on the command-line and then
later passes that filename to os.path.exists().
I don't know how to find out what encoding was used to produce the
bytestring which appears in sys.argv[2]. I'm not even sure that it
is possible to find the answer to that in general. So we could
assume that the encoding is utf-8. That's what our recent patches do
and that's what works on our Ubuntu Feisty buildslave and on Dan's
ArchLinux buildslave when it has a UTF-8 locale. However, if the
argument was not encoded with utf-8, then this will fail or even will
think that it succeeded but will mangle the filename into gibberish.
So another approach would be:
1. don't decode
2. Python 2 "str" type everywhere
3. don't encode
This would work perfectly for the use case of accepting an argument
on the command-line and then calling os.path.exists() and passing
that argument, assuming that each system uses the same encoding for
command-line arguments as it does for names in its filesystem.
A drawback to this approach is that we can't then safely inspect or
change the string! For example, we need to split the string on slash
chars or assert that the string has no slash chars. Without knowing
what encoding was used, or without assuming that the encoding was
utf-8, there is no way to do that. (In fact, I should probably open
a ticket about that issue. There: #565 (unicode arguments on the
command-line).)
So in preparation for the imminent Tahoe-LAFS v1.3.0 release, I need
to do *something* quick and easy. I'm tempted to assume that all
sys.argv arguments are utf-8 encoded, and to utf-8 encode outputs
before passing them to the filesystem in calls like "os.path.exists
()". This will probably make it work on Dan's ArchLinux system with
UTF-8 locale, probably make it work on Ubuntu feisty, probably make
it fail on ArchLinux systems with C locale (??), and probably make it
fail on Windows.
Then there would have to be a known issue documented for the 1.3.0
release saying that non-ascii chars on the command-line are known to
fail on some platforms.
Any better ideas? I really can't spend any more time on this myself,
so hopefully someone who cares about this issue and knows how to
improve it (like François) will jump in.
Regards,
Zooko
---
Tahoe, the Least-Authority Filesystem -- http://allmydata.org
backup all your files for $10/month -- http://allmydata.com/?
tracking=zsig
More information about the tahoe-dev
mailing list