[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler
Glenn Linderman
v+python at g.nevcal.com
Thu May 7 13:30:33 PDT 2009
On approximately 5/7/2009 8:40 AM, came the following characters from
the keyboard of Zooko O'Whielacronx:
> Dear Glenn Linderman and SJT:
>
> You two encoding experts who have volunteered some ideas for Tahoe
> might also be interested in this post that David-Sarah Hopwood just
> sent:
>
> http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html
Regarding this proposal, I would assume (but the proposal should
clarify) that the proposal is looking at a filename, not a pathname, and
that each directory name in a path name would be independently processed
by the algorithms in this proposal.
The proposal has a lot of merit; it avoids the use of meta-data that, as
I pointed out in yesterday's comments, could get lost by transitions
between filesystems.
Whereas my comments yesterday suggested a directory into which
transcoded files could be placed, and that that was problematic (for the
unstated reason of separating files into two buckets), this proposal
suggests reserving the %% and %u and %U file prefixes for transcoded
files. While it keeps the files in the same buckets (directories) which
is good, it raises the question of whether the prefix(es) is/are unique
enough to mostly avoid problems with name collisions.
If some prefix can be thought to be rare enough to avoid problematical
collisions, I would think it should be used consistently, just one
prefix, rather than 3 prefixes, which triple the chances for collisions.
Seems like the distinction between 4-digit and 6-digit Unicode %U
encodings is the + after the %.
The comment that % need not be escaped from shell commands in any common
operating system makes me wonder if the author has ever heard of
Microsoft Windows, or has tried to access a file name name
%%my%dear%faraway%Abby.doc
from a Windows command shell that has environment variables named "my",
"dear", and "faraway" defined.
The definitions of %% and %u enocdings do not mention escaping the
escape character. While the author seems to think that % is rare in
filenames, it cannot be guaranteed to be non-existent, and so the
presence of a % character in a file name that for other reasons must be
%% or %u encoded would introduce ambiguity in the escape sequences.
While a %% prefix for a filename may be quite arguably rare, the %u or
%U prefixes would, by the same argument, be less rare, and the
combination of 3 prefixes be even less rare. Perhaps the %% should be
used as a flag that the name has been transcoded, and then followed by
U, or u, or B, or b, to indicate if it is Unicode or Bytes escaping?
Any such escaping scheme like this could possibly run into length limits
on the names, some discussion about that issue should be included in
such proposals.
The description of %% encoding seems unusual... there are no bytes that
do not correspond to ISO Latin-1 characters, except possibly for control
characters between 1 and 31 inclusive, if they are outlawed in Tahoe
file names (are they? Need they be?). So it seems that %% encoding
would only add a %% in front, and then be mojibake, if the byte encoding
was not originally ISO Latin-1. The %HH sequence seems an almost
unnecessary concept, unless the claimed encoding fails to decode, and
only those characters that fail to decode are then encoded via %HH
sequences. And, preexisting % bytes would seem to also need to be %HH
encoded if anything else might be.
The %U encoding description also suffers not mentioning preexisting %
characters in the string.
The comment that "The %% and %U encodings are never mixed" seems
impossible. I posit a POSIX file name with a non-decodable sequence in
its original encoding; this forces %% encoding inside Tahoe. If such a
name contains a ":", then when a Windows system wants to access the
file, it must be %U encoded. How is the mixture avoided? There is no
description of how to handle this case.
I think a scheme along these lines is workable, though, but some
refinements will be needed, and sufficient use cases provided to help
explain how the various schemes work together, once they are refined,
and if they do work together.
If some unique prefix can be accepted as rare enough to be used as an
encoding prefix by the Tahoe user community, then the rest of the
problems are solvable, but I think there are cases here that are not
solved yet.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the tahoe-dev
mailing list