[tahoe-dev] Unicode issues review
Kevin Reid
kpreid at mac.com
Wed Feb 18 03:42:16 PST 2009
On Feb 17, 2009, at 23:56, Shawn Willden wrote:
> On Tuesday 17 February 2009 09:12:51 pm Kevin Reid wrote:
>> What I'm thinking is:
>>
>> Will supporting unknown-bunch-of-bytes filenames be used sufficiently
>> often to be worth the systemwide complexity in handling them (being
>> not Just Strings), within Tahoe and all client software?
>>
>> If someone knows they have various-encodings filenames then they can
>> just pretend they're Latin-1 -- no information will be lost.
>
> Hmmm. That is certainly a very simple solution.
>
> Just to make sure I understand you, you're suggesting that Tahoe
> clients who are uploading files do the following:
Actually, I wasn't entirely suggesting a specific behavior for
clients, but rather to avoid complicating Tahoe's internals. But I do
have a plan for your scenario:
>
> (2) If the locale decoder can't parse the name, convert it to
> Unicode using
> the latin1 decoder. This will always work because latin1 allows all
> values
> from 0x00 to 0xFF.
No. (2) is not automatic, but rather the user sets the locale, or
tells Tahoe "pretend my locale is latin1"
> Tahoe clients downloading files simply retrieve the UTF-8 name and
> convert it to the locale encoding.
Yes, but respecting the above override.
> The downside, of course, is that when files with such funky names are
> retrieved, they'll be wrong on EVERY platform.
They will be not-wrong to the original uploader when he downloads with
the same settings.
There could also be a flag bit on the filenames which says "this was
uploaded in the byte-preserving mode" and triggers the reverse when
downloading to a compatible filesystem. The advantage of this over
having a "byte-or-Unicode-string" type is that it is always acceptable
for software which just doesn't do raw bytes (e.g. web interfaces) to
ignore that bit, rather than being required to handle it.
Disadvantage: I'm downloading to my filesystem and I expect all my
filenames to be valid UTF-8 and am surprised.
Also, I think "some-other-encoding bytes treated as codepoints and
stuffed into UTF-8" is a not-unheard-of encoding failure mode, and so
it might be not too hard to recognize and repair.
--
Kevin Reid <http://homepage.mac.com/kpreid/>
More information about the tahoe-dev
mailing list