[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler

Glenn Linderman v+python at g.nevcal.com
Thu May 7 13:30:33 PDT 2009


On approximately 5/7/2009 8:40 AM, came the following characters from 
the keyboard of Zooko O'Whielacronx:
> Dear Glenn Linderman and SJT:
> 
> You two encoding experts who have volunteered some ideas for Tahoe
> might also be interested in this post that David-Sarah Hopwood just
> sent:
> 
> http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html


Regarding this proposal, I would assume (but the proposal should 
clarify) that the proposal is looking at a filename, not a pathname, and 
that each directory name in a path name would be independently processed 
by the algorithms in this proposal.

The proposal has a lot of merit; it avoids the use of meta-data that, as 
I pointed out in yesterday's comments, could get lost by transitions 
between filesystems.

Whereas my comments yesterday suggested a directory into which 
transcoded files could be placed, and that that was problematic (for the 
unstated reason of separating files into two buckets), this proposal 
suggests reserving the %% and %u and %U file prefixes for transcoded 
files.  While it keeps the files in the same buckets (directories) which 
is good, it raises the question of whether the prefix(es) is/are unique 
enough to mostly avoid problems with name collisions.

If some prefix can be thought to be rare enough to avoid problematical 
collisions, I would think it should be used consistently, just one 
prefix, rather than 3 prefixes, which triple the chances for collisions.

Seems like the distinction between 4-digit and 6-digit Unicode %U 
encodings is the + after the %.

The comment that % need not be escaped from shell commands in any common 
operating system makes me wonder if the author has ever heard of 
Microsoft Windows, or has tried to access a file name name

%%my%dear%faraway%Abby.doc

from a Windows command shell that has environment variables named "my", 
"dear", and "faraway" defined.

The definitions of %% and %u enocdings do not mention escaping the 
escape character.  While the author seems to think that % is rare in 
filenames, it cannot be guaranteed to be non-existent, and so the 
presence of a % character in a file name that for other reasons must be 
%% or %u encoded would introduce ambiguity in the escape sequences.

While a %% prefix for a filename may be quite arguably rare, the %u or 
%U prefixes would, by the same argument, be less rare, and the 
combination of 3 prefixes be even less rare.  Perhaps the %% should be 
used as a flag that the name has been transcoded, and then followed by 
U, or u, or B, or b, to indicate if it is Unicode or Bytes escaping?

Any such escaping scheme like this could possibly run into length limits 
on the names, some discussion about that issue should be included in 
such proposals.

The description of %% encoding seems unusual... there are no bytes that 
do not correspond to ISO Latin-1 characters, except possibly for control 
characters between 1 and 31 inclusive, if they are outlawed in Tahoe 
file names (are they?  Need they be?).  So it seems that %% encoding 
would only add a %% in front, and then be mojibake, if the byte encoding 
was not originally ISO Latin-1.  The %HH sequence seems an almost 
unnecessary concept, unless the claimed encoding fails to decode, and 
only those characters that fail to decode are then encoded via %HH 
sequences.  And, preexisting % bytes would seem to also need to be %HH 
encoded if anything else might be.

The %U encoding description also suffers not mentioning preexisting % 
characters in the string.

The comment that "The %% and %U encodings are never mixed" seems 
impossible.  I posit a POSIX file name with a non-decodable sequence in 
its original encoding; this forces %% encoding inside Tahoe.  If such a 
name contains a ":", then when a Windows system wants to access the 
file, it must be %U encoded.  How is the mixture avoided?  There is no 
description of how to handle this case.

I think a scheme along these lines is workable, though, but some 
refinements will be needed, and sufficient use cases provided to help 
explain how the various schemes work together, once they are refined, 
and if they do work together.

If some unique prefix can be accepted as rare enough to be used as an 
encoding prefix by the Tahoe user community, then the rest of the 
problems are solvable, but I think there are cases here that are not 
solved yet.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


More information about the tahoe-dev mailing list