[tahoe-dev] File naming on POSIX and Windows clients [was: PEP 383 update: ...]

Sun May 10 01:58:12 PDT 2009

Glenn Linderman writes:

 > This branch of this thread has migrated to tahoe-dev, Stephen, not 
 > python-dev.  So you need to think about their needs if you respond here, 
 > not the needs of Python or python-dev.

Zooko asked for my comments on the protocol for translating from valid
Unicode in Tahoe to whatever on a POSIX system, and the reverse; I
intend to stick to that, until there's an explicit suggestion that the
principle that "Tahoe filenames are valid Unicode" being reconsidered.

 > A PU character registry would remove from Tahoe the ability 
 > for Tahoe clients to use PU characters for their own, actual character 
 > purposes, which may also not be acceptable.

Did you read the post where I explained how this could be done in a
way that does *not* interfere with client use of the PUA?  This use of
the PUA would be *entirely* internal to Tahoe (including display of
the file names), and therefore does not encroach on clients' uses.
(OTOH, the clients can "DoS" Tahoe by using whole planes of PU
characters in file names, but this seems kind of unlikely.)

 > >  > I question how many programs, faced with apparently URL-encoded
 > >  > filenames, actually attempt to URL-decode the name.  Most of what
 > >  > I've seen is that the names simply linger, containing their
 > >  > URL-encoding, and looking ugly.
 > >
 > > I decode such on an ad hoc basis all the time.  I suspect other users
 > > in non-Latin locales will do so, too.
 >
 > So if you have an extra layer of encoding, you will either figure out 
 > how it works, and how and when do the appropriate decoding, or you will 
 > do it wrong and be confused.

Yes.  I think that latter case will be occur frequently for the
proposed %%/%U/%u encoding, balancing its useful features to a great
extent.

 > If Tahoe enforces a consistent normalization, then it would need a 
 > scheme for dealing with the potential duplications that could result 
 > from file systems that don't.

It does, and it does.  The point of the example is that certain types
of use cases are likely to suffer from this a lot, even if "world
wide" it is extremely uncommon on average.

 > The solution for Rock Ridge and Joliet each seem to depend on the 
 > flexibility of the original ISO 9660 system having an "escape" system to 
 > allow alternate names, and each defines a rigid way of using those 
 > alternate names.
 > 
 > Unfortunately, none of the file systems we are talking about do that.  
 > Except, Tahoe _could_.

In fact Tahoe can do it both internally (by adding metadata) and
externally (by convention, eg. creating a file named TRANS.TBL in the
same directory which maps Unicode names to original bytes).  External
conventions are not terribly reliable, but might work in enough cases.

 > Remember that the %% and %u encoding proposal that we are
 > responding to is intended to avoid the idea of fragile metadata
 > that could get lost;

The problem with the encoding proposal is that we already *have* a
universal encoding, and it's called "Unicode".  If Unicode is not
going to work, inventing a new universal encoding is unlikely to work
very well either.  The best bet is to keep any complexity (such as a
PU character registry) entirely internal to Tahoe, while making the
external interface as simple and unambiguous as possible.

Note that "ambiguity" is not entirely determined by the quality of
your algorithms, but also by the kinds of encoding that are used in
the environment.