[tahoe-dev] #534: "tahoe cp" command encoding issue
Shawn Willden
shawn-tahoe at willden.org
Thu Feb 26 10:54:53 PST 2009
On Thursday 26 February 2009 10:56:02 am zooko wrote:
> Strategy 2.c. If it fails, encode the bytes in some magical way that
> a later utf-8 decoding of them will get the same bytes back.
I don't think there can be any such magical encoding, and this isn't what the
KDE folks do.
For it to work, if U represents the Unicode space, B represents the byte
space, D:B->U is the UTF-8 decoding function, and g:U->B is the destination
file system encoding function, then you need a function f:B->U such that:
g(D(f(name_bytes))) = name_bytes
But given that g is unknown, what can you choose for f that will always work?
In many cases there simply isn't a Unicode string which g will decode into
the byte string you want.
What Brian said the KDE guys do is encode such names into an unused region of
the Unicode space. Then they provide a special 'g' that recognizes
characters from that unused region and acts appropriately. Essentially,
they're encoding the "this is invalid even though it looks valid" flag into
the Unicode.
My code, BTW, uses something essentially equivalent to 2.a -- though I don't
have the interoperability constraints, since no one is using my code, not
even me :-)
Oh, a nice way to handle raw strings and still pass them through code that
expects Unicode is (as suggested by Kevin Reid) to take the undecodable bytes
and decode them with the latin1 codec, since any byte string is valid latin1.
Shawn
More information about the tahoe-dev
mailing list