[tahoe-dev] Handling decoding errors
David-Sarah Hopwood
david-sarah at jacaranda.org
Wed May 6 19:32:56 PDT 2009
Here is my proposal for handling decoding errors that takes into
account some of the points raised in response to Zooko's approach.
IMHO, it is highly desirable that copying a file to Tahoe,
*then through an arbitrary sequence of intermediate filesystems*,
then back via Tahoe to the original filesystem, should preserve
the original filename.
This should not depend on preserving metadata associated with
the file; such metadata is too fragile.
I believe that the proposal below satisfies this goal. It may
seem complicated, but it solves more problems, as discussed in
the 'Advantages' section below.
Definitions:
- a "Tahoe filename" is an NFC-normalized Unicode string.
Tahoe filenames do not require any additional flags or
other fields.
- a "Unicode filename" is a Unicode string.
- a "byte-oriented filename" is a byte string.
- a "portable byte" is a byte whose value is the ASCII
code of a character from the POSIX portable filename
character subset.
- a "%%-encoding" is an encoding of a byte string into
an ISO-Latin-1 string, in which each byte value may be
encoded either as the ISO-Latin-1 character with that
code, or as a %HH sequence in which HH is a
case-insensitive hexadecimal value. The resulting
encoding is prefixed with %%.
- a "canonical %%-encoding" is a %%-encoding in which
all non-portable bytes are encoded, no portable bytes
are encoded, and all of the hexadecimal digits in %HH
sequences are uppercase.
- a "%U-encoding" is an encoding of a (not necessarily
valid) Unicode string into another (always valid) Unicode
string, in which each Unicode character may be encoded
either as itself, or as a %HHHH sequence in which HHHH
is a 4-digit case-insensitive hexadecimal representation
of a code point value or an unpaired surrogate code, or
as a %+HHHHHH sequence in which HHHHHH is a 6-digit
case-insensitive hexadecimal representation of a
code point value greater than 0xFFFF. The resulting
encoding is prefixed with %u or %U.
The basic idea is that we use %%-encoded strings to
represent filenames resulting from a failed decoding of
a byte string, and %U-encoded strings to represent Unicode
filenames that could not otherwise be represented by a
particular filesystem. (The %% and %U encodings are never
mixed.)
%U-encoded strings are never needed in a native Tahoe
filesystem, since it can represent arbitrary Unicode
strings without significant restrictions (other than
requiring NFC normalization).
Conversions between Tahoe filenames and Unicode or
byte-oriented filenames would work as follows:
- to convert a byte-oriented filename to a Tahoe filename,
first attempt to use the alleged filesystem decoding
with the strict flag.
If the filename can be strictly decoded and the result
starts with %u or %U, then also %U-decode it.
If the filename cannot be strictly decoded, then use
a canonical %%-encoding of the original byte string as
the Tahoe filename.
- to convert a Tahoe filename starting with %% and
containing only ISO-Latin-1 characters to a byte-oriented
filename, first %%-decode it to a byte string. Use that
byte string as the filename if it is valid for the
destination filesystem.
If the resulting filename would not be valid for the
destination filesystem (for example because the filesystem
or Tahoe has been configured to enforce a particular
encoding, or because the filesystem has reserved names
or characters), then canonicalize the original %%-encoding,
interpret the result as ASCII, and use the resulting byte
string as the filename.
- to convert any other Tahoe filename (i.e. that does not
begin with %% or that contains non-ISO-Latin-1 characters)
to a byte-oriented filename, first attempt to use the
alleged filesystem encoding with the strict flag.
If the filename cannot be strictly encoded (either because
the encoding is not UTF-8 and cannot represent some of
the characters in the string, or because the filesystem
has reserved names or characters), then use the shortest
possible %U-encoding (with only uppercase hexadecimal
digits) that is a representable and valid filename.
- to convert a Tahoe filename to a Unicode filename, if
it is valid for the destination filesystem, then use it
as-is.
If the filename would not be valid for the destination
filesystem, then use the shortest possible %U-encoding
(with only uppercase hexadecimal digits) that is a valid
filename.
If the Tahoe filename starts with %u or %U, then treat
the initial % as being unrepresentable. This is necessary
for correct round-tripping of such names.
- to convert a Unicode filename to a Tahoe filename,
- if it starts with %u or %U, then apply %U-decoding.
- apply NFC normalization.
- whenever a Tahoe filename is converted to a name for a
particular filesystem, if the result is too long for
that filesystem, then fail the operation.
Advantages:
- there is no hidden information in a filename; it is
just a NFC-normalized valid Unicode string. Therefore,
Unicode-based APIs may always be used without any
risk of losing information.
- broken filenames are *visibly* broken, and distinguishable
by the first two characters.
- broken filenames preserve sufficient information to
fix them, even by hand.
- broken filenames contain no private-use characters,
and can easily be typed and displayed. No additional
mojibake is introduced by *detectable* decoding errors.
'%' does not need to be escaped in shell commands on
any common OS, and it is valid in file and directory
names on all common OSes, despite not being in the
POSIX portable subset.
- using the %U encoding, there is a way to losslessly
encode Unicode filenames that would otherwise not be
representable, either because:
- they contain unpaired surrogates or noncharacters, or
- the filesystem uses a byte-oriented encoding that
can only represent a subset of Unicode characters, or
- the filesystem has reserved names or characters
(for example, the filename "prn" can be represented
on a Windows system as "%Uprn").
The latter point is important for security, because it
permits blindly copying files from Tahoe into any filesystem,
without risking that this will have undesirable effects
such as outputting to a device (assuming, of course,
that the Tahoe client is sufficiently paranoid about
the set of reserved names).
Original filenames beginning with %% cannot be
losslessly represented, but that restriction would be
the same across all Tahoe filesystems, and causes no
security problem. It is the fact that every filesystem
has a different set of usable filenames, and that some
filenames can have insecure effects, that is often
problematic for portable software.
- ASCII characters are always preserved even when the
assumed encoding is incorrect [*]. Because %HH sequences are
ASCII, these sequences will also be reliably preserved,
that is, a filename will not be mangled by incorrect
decoding and then further mangled by a subsequent copying
step.
[*] Note that Shift-JIS is the only commonly used 8-bit
encoding (excluding EBCDIC-based encodings) that is
*sometimes* implemented with encoding tables that do
not map all of the ASCII codes in the same way
as US-ASCII. Don't Do That; use the Cp932 tables for
Shift-JIS. This will almost always be correct in
practice, because Cp932 is the mapping used by
Windows systems to implement Shift-JIS.
- % is rare in real-world filenames, and initial %% is even
rarer. Therefore, it is unlikely that a valid Unicode
filename would be *accidentally* clobbered by being copied
over with a name generated by incorrect decoding. It is
much more likely that such a filename resulted from a
previous incorrect decoding from the same byte sequence --
in which case it arguably *should* be clobbered, as it
would be in Zooko's proposal. One difference from Zooko's
proposal here is that the information about the filename
resulting from incorrect decoding is represented in
the filename itself, rather than in a metadata bit that
is liable to get lost when the file is copied out of
Tahoe.
--
David-Sarah Hopwood ⚥
More information about the tahoe-dev
mailing list