[tahoe-dev] Unicode issues review
zooko at zooko.com
Tue Feb 17 10:09:09 PST 2009
On Feb 17, 2009, at 11:03 AM, Jan-Benedict Glaw wrote:
> This happens.
> So technically, "encoding" is a per-file property on some
> filesystems (those that don't care about a filename's contents, as
> long as it doesn't contain the directory delimiter (typically '/'
> or '\\') or the '\0' (end of string)).
Ugh. This would be fine if the filesystem stored and provided the
information about what encoding was used for each name, but I'm
betting they don't do that. :-)
So, what should Tahoe do?
1. Always treat filenames as opaque blobs. This means Tahoe is
losing information that some filesystems (e.g. NTFS) provide, and
making it harder for users on the other side of Tahoe to
unambiguously decode those filenames.
2. If the filesystem guarantees a specific encoding, use that one,
else treat the filename as an opaque blob.
3. If the filesystem guarantees a specific encoding, use that one,
else if it provides a "default" encoding, then try to decode with
that one, and if decoding fails then reject the filename and ask the
user to fix it up.
3.b. ... and if decoding fails then treat the filename as an opaque
3.c. ... and if decoding fails then try to decode it with a few
dozen of our favorite encodings in descending order of popularity ...
4. Any other options?
More information about the tahoe-dev