[tahoe-dev] Unicode issues review
Shawn Willden
shawn-tahoe at willden.org
Tue Feb 17 20:56:55 PST 2009
On Tuesday 17 February 2009 09:12:51 pm Kevin Reid wrote:
> What I'm thinking is:
>
> Will supporting unknown-bunch-of-bytes filenames be used sufficiently
> often to be worth the systemwide complexity in handling them (being
> not Just Strings), within Tahoe and all client software?
>
> If someone knows they have various-encodings filenames then they can
> just pretend they're Latin-1 -- no information will be lost.
Hmmm. That is certainly a very simple solution.
Just to make sure I understand you, you're suggesting that Tahoe clients who
are uploading files do the following:
(1) Attempt to convert the filename to Unicode using the locale decoder. If
it succeeds, fine.
(2) If the locale decoder can't parse the name, convert it to Unicode using
the latin1 decoder. This will always work because latin1 allows all values
from 0x00 to 0xFF.
(3) In either case, convert the Unicode name to UTF-8 and store that in the
dirnode.
Tahoe clients downloading files simply retrieve the UTF-8 name and convert it
to the locale encoding.
The downside, of course, is that when files with such funky names are
retrieved, they'll be wrong on EVERY platform. The information needed to
recover the original byte string will still be present -- all that's needed
is to take the Unicode string and encode it using latin1 to recover the
original byte string -- but there will be no indication that this needs to be
done.
The upside is simple and easy handling of all cases.
Hmmm.
Shawn.
More information about the tahoe-dev
mailing list