[tahoe-dev] #534: "tahoe cp" command encoding issue
zooko at zooko.com
Mon Mar 2 21:29:34 PST 2009
> from a "meaning" perspective in the case of 2.d is wrong to publish
> in the child name a some characters that have an unknown meaning
> and that so are wrongly mapped to unicode entities hoping that the
> client will know how to handle this situation.
Okay. I'm starting to understand this better. I've re-read various
posts to this thread, and read a few web pages on the topic.
*** my realization: Unicode is not enough
Until last weekend, I thought that the the canonical internal
representation of strings in tahoe could be unicode (and the
canonical serialization be utf-8). Now I realize that unicode is not
enough. We need to be able to accept, store, and emit strings which
we cannot translate into unicode. This means that the real canonical
definition of a string has to be a 2-tuple: string of bytes, and a
suggested encoding. We don't always have to explicitly store that 2-
tuple, but we have to think of it as being the canonical information.
Now I think I understand why Brian's proposal  and Alberto's 
included a complete separate copy of the name. (My earlier idea of
storing either the unicode string *or* the "original bytes" in the
same slot using "decode-as-latin-1" is a clever hack to save space,
but prevents us from using a lossy decoding, the better to indicate
decode errors to the user.)
So here is my attempt to synthesize Brian's, Alberto's, François's
, and Shawn's  proposals, plus my own discoveries.
When reading in a filename, what we really want is to get the unicode
of that filename without risk of corruption due to false decoding.
If we can do this, then we can optimize out the storage of the 2-
tuple of (original bytes, suggested encoding) and just store the
unicode. Unfortunately this isn't possible except on Windows and
Macintosh [footnote 1].
If not, then we get the original bytes of the filename, get the
suggested encoding, and store that 2-tuple for optimal fidelity.
Then, we attempt to decode those bytes using the suggested encoding
and the mode which replaces unrecognized bytes with U+FFFD (as
suggested by Alberto). We put the result of that (which is a unicode
object) into the child name field.
One remaining wrinkle is that there could be multiple entries with
the same unicode child name but different (string-of-bytes, suggested-
encoding) 2-tuples. For newer tahoe clients, they could be required
to understand that the unique key is the (string-of-bytes, suggested-
encoding) *if* it is present, else the unique key is the unicode
string. However, what will older tahoe clients do if they get
multiple children in the same directory with the same unicode child
name? Another solution would be to detect these collisions and
further mangle the already mangled unicode names by appending "-1",
footnote 1: how to get unicode filenames with Python on different
On Python on Windows/NTFS, if you invoke os.getcwdu() or you pass a
unicode object to os.listdir(), such as "os.listdir(u'.')", then
you'll get back a list of unicode objects which are guaranteed to
contain the correct values. Be happy! Unless you forget and
accidentally invoke os.getcwd() or pass a string to os.listdir(),
such as "os.listdir('.')". Then you'll get something that is
probably corrupted. Don't do that; use the unicode Python APIs.
On Python on MacOSX/HFS+, if you invoke os.getcwdu() or you pass a
unicode object to os.listdir(), then you'll get back a list of
unicode objects which are guaranteed to contain the correct values.
Be happy! If you forget and invoke os.getcwd() or os.listdir('.')
then you'll get a set of strings which are utf-8 encodings of the
unicode objects. You could then recover by utf-8-decoding them all,
but why? Just use the unicode APIs in the first place.
On Python on other Unix, if you invoke os.getcwdu(), then it will
attempt to decode the cwd using the current locale. If it fails it
will raise a UnicodeDecodeError. If it succeeds then you'll get a
unicode object. If the current locale doesn't indicate the right
encoding for all the elements of the cwd, but they do accidentally
decode, then you'll get a unicode object with a corrupted value in
it. If pass a unicode object to os.listdir(), then you'll get back
the result of attempting to decode the items using the current
locale. If it didn't get a decode error, then the resulting item
will be type unicode. If it did get a decode error, then the
resulting item will be the original bytes in a string. As before, if
the current locale doesn't indicate the right encoding for all the
items in this directory, then some of them may be corrupted.
Conclusion: never use the unicode APIs on Linux. Use the flat
bytestring APIs to get something which is at least guaranteed not to
be corrupted, and use sys.getfilesystemencoding() to get Python's
best guess about the suggested encoding and proceed from there.
I don't know about other filesystems under Windows or Mac such as
VFAT (isn't that the filesystem typically used on thumb drives?) or
More information about the tahoe-dev