[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler

David-Sarah Hopwood david-sarah at jacaranda.org
Fri May 8 06:57:49 PDT 2009


Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>  > On approximately 5/7/2009 8:40 AM, came the following characters from 
>  > the keyboard of Zooko O'Whielacronx:
>  > > Dear Glenn Linderman and SJT:
>  > > 
>  > > You two encoding experts who have volunteered some ideas for Tahoe
>  > > might also be interested in this post that David-Sarah Hopwood just
>  > > sent:
>  > > 
>  > > http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html
>  > 
>  > Regarding this proposal,
> 
> I agree with everything Glenn wrote, except that I disagree with
> 
>  > I think a scheme along these lines is workable, though, but some 
>  > refinements will be needed, and sufficient use cases provided to help 
>  > explain how the various schemes work together, once they are refined, 
>  > and if they do work together.
> 
> While great effort to disambiguate the notation is made, in the end
> Tahoe only controls Tahoe filenames ... but there is no problem with
> them, since they are well-specified as Unicode.  I think that the %%
> notation is going to suffer from the problems that ">From" stuffing
> and URL encoding do.

I disagree, because only Tahoe clients and servers need to implement
this encoding, and they will use common libraries written by
competent people.

URI encoding is complicated by the different treatment of escapes
in different parts of an URI; the scheme I proposed has no such
complexities.

> The choice of "%" as the "escape" character is unfortunate, for the
> reasons Glenn gives but also because of the collision with URL
> encoding.  Spidering tools and the like regularly produce URL-encoded
> filenames, and this will collide with that.

So use '@' or ':' instead.

[Using '%' would still work unless the URL-encoded filename started
with %% or %U. The potential double-escaping of % as %25 or %0025
(only if an URL-encoded filename is undecodable) might be ugly, but
it would not actually cause any serious problems, and similarly for
double-escaping of any other character used as an escape by another
protocol.]

> The Unicode normalization proposed by several of the authors has
> (probably solvable) issues, especially since NFC is chosen.  The
> problem is that an NFC name may fail to roundtrip *via other
> utilities* with a Mac in the middle.

This is independent of the undecodable filename issue. Using
unnormalized or NFD-normalized filenames would cause other problems.

> Finally, here's a radically different suggestion.  Use a separate
> filesystem in a file, such as a zip file, for those files with
> unusable names, and provide a utility for browsing it, as well as
> extracting file names.  This could implement David-Sarah's suggestion
> for automatic extraction of all files as an option.
> 
> The UI I envision would be
> 
> $ tahoe cp tahoe:mystuff ./
> Copying ... done.
> There were 17 files with names that cannot be represented on yoursystem.
> (B)rowse, (I)nteractively rename, (A)utomatically rename, (Q)uit? Q
> 16 files were added to undecodable.tahoezip.
> 1 file was replaced in undecodable.tahoezip.
> To access them, use "tahoe zipview undecodeable.tahoezip".
> $ 

This is very much more complex to implement, and although it gives
users more options, that would in itself be another source of complexity.

-- 
David-Sarah Hopwood ⚥



More information about the tahoe-dev mailing list