[tahoe-dev] Unicode issues review

Tue Feb 17 20:56:55 PST 2009

On Tuesday 17 February 2009 09:12:51 pm Kevin Reid wrote:
> What I'm thinking is:
>
> Will supporting unknown-bunch-of-bytes filenames be used sufficiently
> often to be worth the systemwide complexity in handling them (being
> not Just Strings), within Tahoe and all client software?
>
> If someone knows they have various-encodings filenames then they can
> just pretend they're Latin-1 -- no information will be lost.

Hmmm.  That is certainly a very simple solution.

Just to make sure I understand you, you're suggesting that Tahoe clients who 
are uploading files do the following:

(1) Attempt to convert the filename to Unicode using the locale decoder.  If 
it succeeds, fine.

(2) If the locale decoder can't parse the name, convert it to Unicode using 
the latin1 decoder.  This will always work because latin1 allows all values 
from 0x00 to 0xFF.

(3) In either case, convert the Unicode name to UTF-8 and store that in the 
dirnode.

Tahoe clients downloading files simply retrieve the UTF-8 name and convert it 
to the locale encoding.

The downside, of course, is that when files with such funky names are 
retrieved, they'll be wrong on EVERY platform.  The information needed to 
recover the original byte string will still be present -- all that's needed 
is to take the Unicode string and encode it using latin1 to recover the 
original byte string -- but there will be no indication that this needs to be 
done.

The upside is simple and easy handling of all cases.

Hmmm.

	Shawn.