[tahoe-lafs-trac-stream] [tahoe-lafs] #731: what to do with filenames that are illegal on some systems
tahoe-lafs
trac at tahoe-lafs.org
Mon Jan 9 20:04:22 UTC 2012
#731: what to do with filenames that are illegal on some systems
-------------------------+-------------------------------------------------
Reporter: zooko | Owner:
Type: defect | Status: new
Priority: major | Milestone: eventually
Component: code- | Version: 1.4.1
dirnodes | Keywords: forward-compatibility i18n unicode
Resolution: | names
Launchpad Bug: |
-------------------------+-------------------------------------------------
Comment (by davidsarah):
Replying to [comment:17 zooko]:
> stringprep (RFC 3454) seems like a useful standard:
>
> http://www.ietf.org/rfc/rfc3454.txt
stringprep is one of the worst ideas ever to come out of an IETF Working
Group.
Unicode is a semantic character encoding standard; that is, it makes a
valiant attempt to unify or disunify characters based on distinctions in
meaning and usage, as opposed to visual appearance. A simple example of
this is that Latin 'p' looks identical to Cyrillic 'р', but they are
completely different letters that don't even sound the same. Some people
might consider that to be a problem, but actually it's just a fact about
human scripts.
The International Domain Names Working Group got a bee in their bonnet
about it being a problem that some characters are "confusingly" similar.
Now, given that some commonly used characters are semantically distinct
but look ''identical'' in related fonts, you might think it to be a
quixotic task to somehow deal with the tens of thousands of characters
that only look ''similar'' to some other character, but that didn't stop
the WG arguing about it interminably, and coming up with stringprep in
order to placate the people on one side of the argument -- even though
stringprep doesn't really solve that issue at all.
There are indeed some characters, I call them "junk characters", that we
don't want to use. The polite term for junk characters is "compatibility
characters", most of which are "compatibility composites" as defined in
section 2.3 of the Unicode Standard. These characters are only in Unicode
because some national body insisted on round-tripping between Unicode and
their misdesigned legacy standard (which could have been done in other
ways that would have been more technically elegant than assigning many ad-
hoc character variants, but that's water under the bridge).
The right place to implement "don't use junk characters" is in input
methods. That is, if a user can never type a junk character, then it's
much less likely that its existence will cause a problem. More
specifically, if a user can only type non-junk characters in some
normalization form (preferably NFC), then name lookups based on exact
matching, as needed for filenames and other identifiers, are more likely
to work.
The '''wrong''' thing to do is what stringprep tries to do, which is to
map junk characters to somebody's idea of the nearest non-junk characters.
This just causes unintended name collisions and breakage, and doesn't get
any closer to solving the unsolvable issue of confusable characters.
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/731#comment:18>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list