[tahoe-dev] Unicode issues review

Wed Feb 18 03:42:16 PST 2009

On Feb 17, 2009, at 23:56, Shawn Willden wrote:
> On Tuesday 17 February 2009 09:12:51 pm Kevin Reid wrote:
>> What I'm thinking is:
>>
>> Will supporting unknown-bunch-of-bytes filenames be used sufficiently
>> often to be worth the systemwide complexity in handling them (being
>> not Just Strings), within Tahoe and all client software?
>>
>> If someone knows they have various-encodings filenames then they can
>> just pretend they're Latin-1 -- no information will be lost.
>
> Hmmm.  That is certainly a very simple solution.
>
> Just to make sure I understand you, you're suggesting that Tahoe  
> clients who are uploading files do the following:

Actually, I wasn't entirely suggesting a specific behavior for  
clients, but rather to avoid complicating Tahoe's internals. But I do  
have a plan for your scenario:
>
> (2) If the locale decoder can't parse the name, convert it to  
> Unicode using
> the latin1 decoder.  This will always work because latin1 allows all  
> values
> from 0x00 to 0xFF.

No. (2) is not automatic, but rather the user sets the locale, or  
tells Tahoe "pretend my locale is latin1"

> Tahoe clients downloading files simply retrieve the UTF-8 name and  
> convert it to the locale encoding.

Yes, but respecting the above override.

> The downside, of course, is that when files with such funky names are
> retrieved, they'll be wrong on EVERY platform.

They will be not-wrong to the original uploader when he downloads with  
the same settings.

There could also be a flag bit on the filenames which says "this was  
uploaded in the byte-preserving mode" and triggers the reverse when  
downloading to a compatible filesystem. The advantage of this over  
having a "byte-or-Unicode-string" type is that it is always acceptable  
for software which just doesn't do raw bytes (e.g. web interfaces) to  
ignore that bit, rather than being required to handle it.  
Disadvantage: I'm downloading to my filesystem and I expect all my  
filenames to be valid UTF-8 and am surprised.

Also, I think "some-other-encoding bytes treated as codepoints and  
stuffed into UTF-8" is a not-unheard-of encoding failure mode, and so  
it might be not too hard to recognize and repair.

-- 
Kevin Reid                            <http://homepage.mac.com/kpreid/>