[tahoe-lafs-trac-stream] [tahoe-lafs] #731: what to do with filenames that are illegal on some systems

Mon Jan 9 20:04:22 UTC 2012

#731: what to do with filenames that are illegal on some systems
-------------------------+-------------------------------------------------
     Reporter:  zooko    |      Owner:
         Type:  defect   |     Status:  new
     Priority:  major    |  Milestone:  eventually
    Component:  code-    |    Version:  1.4.1
  dirnodes               |   Keywords:  forward-compatibility i18n unicode
   Resolution:           |  names
Launchpad Bug:           |
-------------------------+-------------------------------------------------

Comment (by davidsarah):

 Replying to [comment:17 zooko]:
 > stringprep (RFC 3454) seems like a useful standard:
 >
 > http://www.ietf.org/rfc/rfc3454.txt

 stringprep is one of the worst ideas ever to come out of an IETF Working
 Group.

 Unicode is a semantic character encoding standard; that is, it makes a
 valiant attempt to unify or disunify characters based on distinctions in
 meaning and usage, as opposed to visual appearance. A simple example of
 this is that Latin 'p' looks identical to Cyrillic 'р', but they are
 completely different letters that don't even sound the same. Some people
 might consider that to be a problem, but actually it's just a fact about
 human scripts.

 The International Domain Names Working Group got a bee in their bonnet
 about it being a problem that some characters are "confusingly" similar.
 Now, given that some commonly used characters are semantically distinct
 but look ''identical'' in related fonts, you might think it to be a
 quixotic task to somehow deal with the tens of thousands of characters
 that only look ''similar'' to some other character, but that didn't stop
 the WG arguing about it interminably, and coming up with stringprep in
 order to placate the people on one side of the argument -- even though
 stringprep doesn't really solve that issue at all.

 There are indeed some characters, I call them "junk characters", that we
 don't want to use. The polite term for junk characters is "compatibility
 characters", most of which are "compatibility composites" as defined in
 section 2.3 of the Unicode Standard. These characters are only in Unicode
 because some national body insisted on round-tripping between Unicode and
 their misdesigned legacy standard (which could have been done in other
 ways that would have been more technically elegant than assigning many ad-
 hoc character variants, but that's water under the bridge).

 The right place to implement "don't use junk characters" is in input
 methods. That is, if a user can never type a junk character, then it's
 much less likely that its existence will cause a problem. More
 specifically, if a user can only type non-junk characters in some
 normalization form (preferably NFC), then name lookups based on exact
 matching, as needed for filenames and other identifiers, are more likely
 to work.

 The '''wrong''' thing to do is what stringprep tries to do, which is to
 map junk characters to somebody's idea of the nearest non-junk characters.
 This just causes unintended name collisions and breakage, and doesn't get
 any closer to solving the unsolvable issue of confusable characters.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/731#comment:18>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage