[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler

Fri May 8 10:31:56 PDT 2009

On approximately 5/8/2009 2:31 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>  > On approximately 5/7/2009 8:40 AM, came the following characters from 
>  > the keyboard of Zooko O'Whielacronx:
>  > > Dear Glenn Linderman and SJT:
>  > > 
>  > > You two encoding experts who have volunteered some ideas for Tahoe
>  > > might also be interested in this post that David-Sarah Hopwood just
>  > > sent:
>  > > 
>  > > http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html
>  > 
>  > 
>  > Regarding this proposal,
> 
> I agree with everything Glenn wrote, except that I disagree with
> 
>  > I think a scheme along these lines is workable, though, but some 
>  > refinements will be needed, and sufficient use cases provided to help 
>  > explain how the various schemes work together, once they are refined, 
>  > and if they do work together.
> 
> While great effort to disambiguate the notation is made, in the end
> Tahoe only controls Tahoe filenames ... but there is no problem with
> them, since they are well-specified as Unicode.  

Well, Stephen, you are correct that there is no problem with Tahoe 
filenames... except that the fact that they are restricted to Unicode, 
and POSIX filenames are not, _is_ a problem.

Tahoe only controls Tahoe filenames, and that will always be true.  But 
Tahoe cannot control POSIX filename rules, nor Windows filename rules, 
nor any other rules that get invented, and that will always be a problem 
for users that have such filenames, and want to use Tahoe for some of 
its perceived benefits.

> I think that the %%
> notation is going to suffer from the problems that ">From" stuffing
> and URL encoding do.  Programs and users are going to get confused
> about whether a string has already been decoded, with at best
> hilarious results.  Of course a sufficiently complex set of rules will
> probably work in theory, but will not be implemented properly too much
> of the time.  Especially not by users.

As presently defined, %% notation has problems, I agree. And if other 
programs get in the act of interpreting the names, and trying to 
re-encode them, "just like Tahoe would" to try to obtain the original 
bits of such a name, then it is imperative that Tahoe publish a rigorous 
specification for what it does... or better, also provide an API to 
perform the translations.  In order to transport non-Unicode conformant 
names over a Unicode-conformant transport mechanism (Web interfaces with 
a Unicode encoding), your idea below of using BASE64, would work well 
for the purpose of such an API.

> The choice of "%" as the "escape" character is unfortunate, for the
> reasons Glenn gives but also because of the collision with URL
> encoding.  Spidering tools and the like regularly produce URL-encoded
> filenames, and this will collide with that.  Eg, as a regular visitor
> to Japanese sites, URL-encoded file names are occasionally produced on
> my system when I save a page.  And if an URL-encoded filename gets
> Tahoe-encoded or vice versa, you'll need to know which order to decode
> in; they do not commute IIUC.  Attempting to upload a file with a
> %%-encoded name is likely to produce bad results on systems that could
> handle the name.

Indeed, % is already used by other encodings, and at best, its use would 
require careful escaping of other % (which I suggested), and at worst 
could be misinterpreted by any programs that think they know what the 
filename encoding is, but don't.  But the latter, which seems to Stephen 
to be particularly anathema, is simply a bug in or misuse of such 
programs.  I question how many programs, faced with apparently 
URL-encoded filenames, actually attempt to URL-decode the name.  Most of 
what I've seen is that the names simply linger, containing their 
URL-encoding, and looking ugly.  It would be appropriate for the 
application downloading such a file to perform the URL-decoding, and 
save a decent name in the filesystem... but not even browsers seem to do 
this as a matter of course, which I consider to be a bug.

Of course, the situation of a POSIX name starting as a URL-encoded name 
(which, it should be noted, is strictly Unicode conformant), but then 
being possibly mojibaked by use of multiple fs encodings used by POSIX 
programs, such mojibaking would not generally alter the % though, since 
it is ASCII, but may in some way require Tahoe %% encoding when the name 
is then uploaded to Tahoe.  Then when the Tahoe file is downloaded via 
the Web, there could certainly be a plethora of % characters in the 
resulting name.

For this reason, it might be better to choose a different encoding 
character.  One could speculate about choosing Unicode characters for 
this purpose, but that speculation should be squelched, because they 
would be especially susceptible to encoding problems in POSIX.  The 
encoding character should be a character from the intersection of ASCII 
and legal Windows characters (with hopes that any future system that 
makes character blacklists won't blacklist the chosen character).

At this point, it is appropriate to point out that the transcoding 
algorithms between Tahoe and any particular non-Tahoe system need not be 
the same as the transcoding algorithms between Tahoe and any other 
particular non-Tahoe system.

Even in the current proposal for %% encoded, the above was alluded to 
for Windows blacklisted (device) names... %U encoding such names was 
suggested... but only for Windows systems, not for POSIX systems.

Using % as a placeholder for the eventually chosen encoding character, 
it might make sense for the encoding introducer to be %%, followed by a 
set of characters defining the encoding schemes applied, following by 
one more %, and then the encoded name.

This would cause the proposals %% encoding, which I called binary, to 
use a prefix of %%b% (for example), and the proposals %u encoding, 
called Unicode, to use a prefix of %%u%.  If a name were necessarily 
encoded using both schemes, if it were first encoded as %%b% and then 
%%u%, it would produce %%bu% as a prefix.  The other order would produce 
%%ub%.  Other encodings could be invented, and given different letters, 
so up to 26 encodings could be easily differentiated, and the Tahoe 
decoding API could determine the appropriate decodings from the prefix.

> More positive suggestions:
> 
> If nonetheless you decide to use such an encoding, a similar
> possibility that avoids collision with URL encoding would be to
> represent names unrepresentable on the target file system using the
> old Mac OS convention of representing a high-bit-set octet with ":XX"
> where the Xs are of course uppercase hex digits.  Another possibility
> would be simply to use a leading ":" to signal that all of the
> characters in the name are hex digits.  Of course both imply that a
> file whose name already starts with ":" must be hex-encoded.

: is illegal in Windows filenames, so this specific idea seems doomed, 
at least on that platform (see remarks about regarding multiple encodings).

> Another possibility would be MIME-word encoding.

MIME encoding, while very effective in appropriate contexts, suffers in 
this context from the use of + (conflicts with URL encoding) and / 
(illegal in Windows filenames).  It also suffers from being dramatically 
different than the original text, even for characters that are legal in 
all encodings.  It is hard for humans to recognize the original text by 
looking at the MIME encoded text.

> The Unicode normalization proposed by several of the authors has
> (probably solvable) issues, especially since NFC is chosen.  The
> problem is that an NFC name may fail to roundtrip *via other
> utilities* with a Mac in the middle.  On several occasions I've found
> myself looking at two files with the same name on a Linux system
> because I copied an NFC file name (as bytes) to the Mac, which
> recognized those bytes as a Unicode transformation format, and when an
> updated version of the file was copied back, the name goes back as
> bytes, but of course it is now NFD.  Other utilities are Unicode
> conformant and get this right, but I don't think you can count on it
> yet.

The Unicode normalization issues for a specific platform can be solved 
by the Tahoe client programs created for that platform.  In other words, 
NFD names found on Mac OS X can be renormalized to NFC by Tahoe client 
programs, or upon receipt by a Tahoe server that knows it is talking to 
a Mac OS X client.

> Finally, here's a radically different suggestion.  Use a separate
> filesystem in a file, such as a zip file, for those files with
> unusable names, and provide a utility for browsing it, as well as
> extracting file names.  This could implement David-Sarah's suggestion
> for automatic extraction of all files as an option.

This suffers from the same problem as my earlier suggestion of using a 
separate directory, rather than a prefix, for encoded names... the files 
get placed in separate buckets, and globs don't work as uniformly.

> The UI I envision would be
> 
> $ tahoe cp tahoe:mystuff ./
> Copying ... done.
> There were 17 files with names that cannot be represented on yoursystem.
> (B)rowse, (I)nteractively rename, (A)utomatically rename, (Q)uit? Q
> 16 files were added to undecodable.tahoezip.
> 1 file was replaced in undecodable.tahoezip.
> To access them, use "tahoe zipview undecodeable.tahoezip".
> $ 
> 
> Of course this could all be handled invisibly by a FUSE filesystem,
> where FUSE is available.

I'm not familiar with FUSE, but a quick review of its capabilities do 
seem to indicate that a FUSE-for-Tahoe binding might be a nice way of 
implementing Tahoe clients on POSIX-with-FUSE.

> Finally, this problem has been encountered before in ISO 9660.  That
> standard has extensions (I believe that these are the so-called "Rock
> Ridge extensions") that allow for long and/or internationalized file
> names.  Perhaps those conventions (about which I know none of the
> details, sorry) could be used.

I think ISO 9660 limited filenames to A-Z0-9 and 8.3 format.  Rock Ridge 
allows other character sets; I suppose one of the allowable other 
character sets might be Unicode UTF-8, or POSIX bytes, I haven't looked 
that up.  The Joliet (MS) extension allows UCS-2, except for control 
characters and 6 blacklisted characters.

I don't think the problems correspond particularly well.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking