[tahoe-dev] Unicode issues review

Shawn Willden shawn-tahoe at willden.org
Tue Feb 17 11:04:23 PST 2009


On Tuesday 17 February 2009 11:09:09 am zooko wrote:
> 3.  If the filesystem guarantees a specific encoding, use that one,
> else if it provides a "default" encoding, then try to decode with
> that one, and if decoding fails then reject the filename and ask the
> user to fix it up.
>
> 3.b.  ... and if decoding fails then treat the filename as an opaque
> blob.

I think this is the best option.

> 3.c.  ... and if decoding fails then try to decode it with a few
> dozen of our favorite encodings in descending order of popularity ...

You could do this as well, with a fallback to 3.b. if none of them work.  I'm 
not sure how useful it is to try different encodings, though, because you're 
going to end up with the first one that decodes it "successfully", rather 
than the first one that produces a sensible result -- unless you ask the user 
which one is right, I guess.

Maybe rather than trying a bunch of different encodings, just try:

(1)  The encoding for the current locale (whether specified per-file or by the 
environment.
(2)  UTF-8

And if neither of those work, then treat it as an opaque blob.  Perhaps toss 
in UTF-16 as well.  The nice thing about UTF-8 and UTF-16 is that when you 
try to decode crap with them they usually fail, rather than just silently 
giving you crap back.  Usually :-)

Eventually the non-Unicode encodings should gradually disappear, so this 
should break less and less often as time goes on.

> 4.  Any other options?

I don't see any.

	Shawn.


More information about the tahoe-dev mailing list