[tahoe-dev] File naming on POSIX and Windows clients

Mon May 11 09:29:23 PDT 2009

Glenn Linderman wrote:
> Without knowing the correct encoding, the result is, unfortunately, 
> mojibake, and no additional encoding solution will make that clearer.
> 
> Since it is mojibake anyway, one could use a mojibake encoding algorithm 
> such as
> 
> 1) If the name decodes to Unicode successfully using the current file 
> system encoding, use that name.
> 
> 2) Obtain the bytes, and create a Unicode name that starts with ^^ and 
> is followed by one codepoint per byte, where the codepoint is 
> numerically calculated from each byte value, as  
> 
> (bytevalue is < 128 and legal in Windows filenames) ? bytevalue : 
> bytevalue + 256

This solves the problem of representing undecodable names only when
the destination filesystem is a Windows filesystem, or at best a
filesystem that supports a Unicode API. If it does not support a
Unicode API, then this proposal is unreliable because the alleged
filesystem encoding for the destination filesystem might be wrong,
or it might not support characters U+0100..U+01FF.

The hex encoding scheme I originally proposed, by restricting encoded
names to the POSIX subset plus the escape character, works for all
destination filesystems even when their encoding is not known reliably.

> It would be extremely simple to code, encode, and decode, and would have 
> only displayable characters.

U+0100..U+01FF (part of the Latin Extended ranges) are not necessarily
displayable in all fonts, even supposed "Unicode fonts"
(http://en.wikipedia.org/wiki/Unicode_typefaces#0000-077F).
I believe some of these characters are not used in any modern language,
and most cannot be entered on normal keyboards. Some have canonical
decompositions, which might cause problems if they are copied to a Mac
filesystem (using NFD) and then to a POSIX one.

Overall, the only significant advantage of this over the hex encoding
is in the length of encoded names (measured in Unicode code points).
It is true that some filesystems have quite short maximum filename
lengths and so length is an issue, but that does not outweigh the
disadvantages discussed above, IMHO.

An encoding that would mostly solve the length issue while staying
within the POSIX subset, would be Punycode (RFC 3492, used without
Nameprep). However Punycode is substantially more complex, and could
not quite be used as-is because its encoding algorithm may overflow
for inputs greater than 63 characters.

[ISTR that there were other algorithms proposed during the IDN
standardization process that were much simpler, almost as length-
efficient, and did not have this input length limitation. I gave up
on the IDN standards process in disgust; it was most defective
IETF Working Group I've ever been involved with. But I digress.]

> A lookup table would work well in both  directions.
> 
> A variation might notice any . characters in the name, and encode/decode 
> each part of the name between . characters independently.  This might 
> help preserve file extensions that might be in ASCII.

This is unnecessary for the hex encoding because file extensions in
the POSIX subset would be preserved in any case ('.', [a-z] and [A-Z]
are in that subset). Although I didn't mention it in the original post,
this is partly why the encoding indicator is a prefix and not a suffix.

It would be necessary if Punycode or some similar encoding were used,
though.

-- 
David-Sarah Hopwood ⚥