[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler

Fri May 8 02:31:57 PDT 2009

Glenn Linderman writes:
 > On approximately 5/7/2009 8:40 AM, came the following characters from 
 > the keyboard of Zooko O'Whielacronx:
 > > Dear Glenn Linderman and SJT:
 > > 
 > > You two encoding experts who have volunteered some ideas for Tahoe
 > > might also be interested in this post that David-Sarah Hopwood just
 > > sent:
 > > 
 > > http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html
 > 
 > 
 > Regarding this proposal,

I agree with everything Glenn wrote, except that I disagree with

 > I think a scheme along these lines is workable, though, but some 
 > refinements will be needed, and sufficient use cases provided to help 
 > explain how the various schemes work together, once they are refined, 
 > and if they do work together.

While great effort to disambiguate the notation is made, in the end
Tahoe only controls Tahoe filenames ... but there is no problem with
them, since they are well-specified as Unicode.  I think that the %%
notation is going to suffer from the problems that ">From" stuffing
and URL encoding do.  Programs and users are going to get confused
about whether a string has already been decoded, with at best
hilarious results.  Of course a sufficiently complex set of rules will
probably work in theory, but will not be implemented properly too much
of the time.  Especially not by users.

The choice of "%" as the "escape" character is unfortunate, for the
reasons Glenn gives but also because of the collision with URL
encoding.  Spidering tools and the like regularly produce URL-encoded
filenames, and this will collide with that.  Eg, as a regular visitor
to Japanese sites, URL-encoded file names are occasionally produced on
my system when I save a page.  And if an URL-encoded filename gets
Tahoe-encoded or vice versa, you'll need to know which order to decode
in; they do not commute IIUC.  Attempting to upload a file with a
%%-encoded name is likely to produce bad results on systems that could
handle the name.

More positive suggestions:

If nonetheless you decide to use such an encoding, a similar
possibility that avoids collision with URL encoding would be to
represent names unrepresentable on the target file system using the
old Mac OS convention of representing a high-bit-set octet with ":XX"
where the Xs are of course uppercase hex digits.  Another possibility
would be simply to use a leading ":" to signal that all of the
characters in the name are hex digits.  Of course both imply that a
file whose name already starts with ":" must be hex-encoded.

Another possibility would be MIME-word encoding.

The Unicode normalization proposed by several of the authors has
(probably solvable) issues, especially since NFC is chosen.  The
problem is that an NFC name may fail to roundtrip *via other
utilities* with a Mac in the middle.  On several occasions I've found
myself looking at two files with the same name on a Linux system
because I copied an NFC file name (as bytes) to the Mac, which
recognized those bytes as a Unicode transformation format, and when an
updated version of the file was copied back, the name goes back as
bytes, but of course it is now NFD.  Other utilities are Unicode
conformant and get this right, but I don't think you can count on it
yet.

Finally, here's a radically different suggestion.  Use a separate
filesystem in a file, such as a zip file, for those files with
unusable names, and provide a utility for browsing it, as well as
extracting file names.  This could implement David-Sarah's suggestion
for automatic extraction of all files as an option.

The UI I envision would be

$ tahoe cp tahoe:mystuff ./
Copying ... done.
There were 17 files with names that cannot be represented on yoursystem.
(B)rowse, (I)nteractively rename, (A)utomatically rename, (Q)uit? Q
16 files were added to undecodable.tahoezip.
1 file was replaced in undecodable.tahoezip.
To access them, use "tahoe zipview undecodeable.tahoezip".
$ 

Of course this could all be handled invisibly by a FUSE filesystem,
where FUSE is available.

Finally, this problem has been encountered before in ISO 9660.  That
standard has extensions (I believe that these are the so-called "Rock
Ridge extensions") that allow for long and/or internationalized file
names.  Perhaps those conventions (about which I know none of the
details, sorry) could be used.