[tahoe-dev] Handling decoding errors

Wed May 6 19:32:56 PDT 2009

Here is my proposal for handling decoding errors that takes into
account some of the points raised in response to Zooko's approach.

IMHO, it is highly desirable that copying a file to Tahoe,
*then through an arbitrary sequence of intermediate filesystems*,
then back via Tahoe to the original filesystem, should preserve
the original filename.

This should not depend on preserving metadata associated with
the file; such metadata is too fragile.

I believe that the proposal below satisfies this goal. It may
seem complicated, but it solves more problems, as discussed in
the 'Advantages' section below.

Definitions:

 - a "Tahoe filename" is an NFC-normalized Unicode string.
   Tahoe filenames do not require any additional flags or
   other fields.

 - a "Unicode filename" is a Unicode string.

 - a "byte-oriented filename" is a byte string.

 - a "portable byte" is a byte whose value is the ASCII
   code of a character from the POSIX portable filename
   character subset.

 - a "%%-encoding" is an encoding of a byte string into
   an ISO-Latin-1 string, in which each byte value may be
   encoded either as the ISO-Latin-1 character with that
   code, or as a %HH sequence in which HH is a
   case-insensitive hexadecimal value. The resulting
   encoding is prefixed with %%.

 - a "canonical %%-encoding" is a %%-encoding in which
   all non-portable bytes are encoded, no portable bytes
   are encoded, and all of the hexadecimal digits in %HH
   sequences are uppercase.

 - a "%U-encoding" is an encoding of a (not necessarily
   valid) Unicode string into another (always valid) Unicode
   string, in which each Unicode character may be encoded
   either as itself, or as a %HHHH sequence in which HHHH
   is a 4-digit case-insensitive hexadecimal representation
   of a code point value or an unpaired surrogate code, or
   as a %+HHHHHH sequence in which HHHHHH is a 6-digit
   case-insensitive hexadecimal representation of a
   code point value greater than 0xFFFF. The resulting
   encoding is prefixed with %u or %U.

The basic idea is that we use %%-encoded strings to
represent filenames resulting from a failed decoding of
a byte string, and %U-encoded strings to represent Unicode
filenames that could not otherwise be represented by a
particular filesystem. (The %% and %U encodings are never
mixed.)

%U-encoded strings are never needed in a native Tahoe
filesystem, since it can represent arbitrary Unicode
strings without significant restrictions (other than
requiring NFC normalization).

Conversions between Tahoe filenames and Unicode or
byte-oriented filenames would work as follows:

 - to convert a byte-oriented filename to a Tahoe filename,
   first attempt to use the alleged filesystem decoding
   with the strict flag.

   If the filename can be strictly decoded and the result
   starts with %u or %U, then also %U-decode it.

   If the filename cannot be strictly decoded, then use
   a canonical %%-encoding of the original byte string as
   the Tahoe filename.

 - to convert a Tahoe filename starting with %% and
   containing only ISO-Latin-1 characters to a byte-oriented
   filename, first %%-decode it to a byte string. Use that
   byte string as the filename if it is valid for the
   destination filesystem.

   If the resulting filename would not be valid for the
   destination filesystem (for example because the filesystem
   or Tahoe has been configured to enforce a particular
   encoding, or because the filesystem has reserved names
   or characters), then canonicalize the original %%-encoding,
   interpret the result as ASCII, and use the resulting byte
   string as the filename.

 - to convert any other Tahoe filename (i.e. that does not
   begin with %% or that contains non-ISO-Latin-1 characters)
   to a byte-oriented filename, first attempt to use the
   alleged filesystem encoding with the strict flag.

   If the filename cannot be strictly encoded (either because
   the encoding is not UTF-8 and cannot represent some of
   the characters in the string, or because the filesystem
   has reserved names or characters), then use the shortest
   possible %U-encoding (with only uppercase hexadecimal
   digits) that is a representable and valid filename.

 - to convert a Tahoe filename to a Unicode filename, if
   it is valid for the destination filesystem, then use it
   as-is.

   If the filename would not be valid for the destination
   filesystem, then use the shortest possible %U-encoding
   (with only uppercase hexadecimal digits) that is a valid
   filename.

   If the Tahoe filename starts with %u or %U, then treat
   the initial % as being unrepresentable. This is necessary
   for correct round-tripping of such names.

 - to convert a Unicode filename to a Tahoe filename,
    - if it starts with %u or %U, then apply %U-decoding.
    - apply NFC normalization.

 - whenever a Tahoe filename is converted to a name for a
   particular filesystem, if the result is too long for
   that filesystem, then fail the operation.

Advantages:

 - there is no hidden information in a filename; it is
   just a NFC-normalized valid Unicode string. Therefore,
   Unicode-based APIs may always be used without any
   risk of losing information.

 - broken filenames are *visibly* broken, and distinguishable
   by the first two characters.

 - broken filenames preserve sufficient information to
   fix them, even by hand.

 - broken filenames contain no private-use characters,
   and can easily be typed and displayed. No additional
   mojibake is introduced by *detectable* decoding errors.

   '%' does not need to be escaped in shell commands on
   any common OS, and it is valid in file and directory
   names on all common OSes, despite not being in the
   POSIX portable subset.

 - using the %U encoding, there is a way to losslessly
   encode Unicode filenames that would otherwise not be
   representable, either because:

    - they contain unpaired surrogates or noncharacters, or

    - the filesystem uses a byte-oriented encoding that
      can only represent a subset of Unicode characters, or

    - the filesystem has reserved names or characters
      (for example, the filename "prn" can be represented
      on a Windows system as "%Uprn").

   The latter point is important for security, because it
   permits blindly copying files from Tahoe into any filesystem,
   without risking that this will have undesirable effects
   such as outputting to a device (assuming, of course,
   that the Tahoe client is sufficiently paranoid about
   the set of reserved names).

   Original filenames beginning with %% cannot be
   losslessly represented, but that restriction would be
   the same across all Tahoe filesystems, and causes no
   security problem. It is the fact that every filesystem
   has a different set of usable filenames, and that some
   filenames can have insecure effects, that is often
   problematic for portable software.

 - ASCII characters are always preserved even when the
   assumed encoding is incorrect [*]. Because %HH sequences are
   ASCII, these sequences will also be reliably preserved,
   that is, a filename will not be mangled by incorrect
   decoding and then further mangled by a subsequent copying
   step.

   [*] Note that Shift-JIS is the only commonly used 8-bit
       encoding (excluding EBCDIC-based encodings) that is
       *sometimes* implemented with encoding tables that do
       not map all of the ASCII codes in the same way
       as US-ASCII. Don't Do That; use the Cp932 tables for
       Shift-JIS. This will almost always be correct in
       practice, because Cp932 is the mapping used by
       Windows systems to implement Shift-JIS.

 - % is rare in real-world filenames, and initial %% is even
   rarer. Therefore, it is unlikely that a valid Unicode
   filename would be *accidentally* clobbered by being copied
   over with a name generated by incorrect decoding. It is
   much more likely that such a filename resulted from a
   previous incorrect decoding from the same byte sequence --
   in which case it arguably *should* be clobbered, as it
   would be in Zooko's proposal. One difference from Zooko's
   proposal here is that the information about the filename
   resulting from incorrect decoding is represented in
   the filename itself, rather than in a metadata bit that
   is liable to get lost when the file is copied out of
   Tahoe.

-- 
David-Sarah Hopwood ⚥