[tahoe-dev] File naming on POSIX and Windows clients [was: PEP 383 update: ...]
Glenn Linderman
v+python at g.nevcal.com
Sat May 9 15:24:13 PDT 2009
On approximately 5/9/2009 9:58 AM, came the following characters from
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
> > > While great effort to disambiguate the notation is made, in the end
> > > Tahoe only controls Tahoe filenames ... but there is no problem with
> > > them, since they are well-specified as Unicode.
> >
> > Well, Stephen, you are correct that there is no problem with Tahoe
> > filenames... except that the fact that they are restricted to Unicode,
> > and POSIX filenames are not, _is_ a problem.
>
> Sure, but it's a *solved* problem (surrogate-escape coding systems do
> it simply, a PU character registry does it in a more complicated way).
> Tahoe doesn't seem to like those schemes, too bad for Tahoe -- but
> it's not *our* problem in this thread.
>
This branch of this thread has migrated to tahoe-dev, Stephen, not
python-dev. So you need to think about their needs if you respond here,
not the needs of Python or python-dev.
So while one variety of solution to the problem has been proposed and
accepted for Python on POSIX, where invalid Unicode surrogate-escape
sequences have been pronounced to be acceptable, even though they are
totally unreadable, the environment in which Tahoe is operating is
constrained to strictly legal Unicode, so they cannot use that
solution. A PU character registry would remove from Tahoe the ability
for Tahoe clients to use PU characters for their own, actual character
purposes, which may also not be acceptable.
So yes, it is a solved problem in the same sense that telling the
peasants to eat cake solved the famine in France.
> > As presently defined, %% notation has problems, I agree. And if other
> > programs get in the act of interpreting the names, and trying to
> > re-encode them, "just like Tahoe would"
>
> You might have a hope if the intent was to emulate Tahoe. But
> those names may get munged by other transports etc. and people will
> undoubtedly be using ad hoc algorithms.
>
> > I question how many programs, faced with apparently URL-encoded
> > filenames, actually attempt to URL-decode the name. Most of what
> > I've seen is that the names simply linger, containing their
> > URL-encoding, and looking ugly.
>
> I decode such on an ad hoc basis all the time. I suspect other users
> in non-Latin locales will do so, too.
>
So if you have an extra layer of encoding, you will either figure out
how it works, and how and when do the appropriate decoding, or you will
do it wrong and be confused.
> > At this point, it is appropriate to point out that the transcoding
> > algorithms between Tahoe and any particular non-Tahoe system need not be
> > the same as the transcoding algorithms between Tahoe and any other
> > particular non-Tahoe system.
>
> I don't think you want to go there. That will confuse the heck out of
> multihomed users, who would at least like to see the same mojibake on
> different systems.
>
Every system has its quirks... that is why this thread even exists. It
is not clear that encoding names that are unacceptable to one type of
system on all types of systems (encoding to the LCD) is beneficial. For
example, as far as I know, there is no reason a file named "prn.foo"
should be encoded on a Mac, but it certainly needs to be encoded on Windows.
It may be that the encoding _system_ can be uniform, at least on large
groups of platforms, such that the same decoding algorithm will work on
all systems, but it is probably true that the choice of what names must
be encoded, and what names need not be, is platform dependent.
> > > The Unicode normalization proposed by several of the authors has
> > > (probably solvable) issues, especially since NFC is chosen. The
> > > problem is that an NFC name may fail to roundtrip *via other
> > > utilities* with a Mac in the middle. On several occasions I've found
> > > myself looking at two files with the same name on a Linux system
>
> > The Unicode normalization issues for a specific platform can be solved
> > by the Tahoe client programs created for that platform. In other words,
> > NFD names found on Mac OS X can be renormalized to NFC by Tahoe client
> > programs, or upon receipt by a Tahoe server that knows it is talking to
> > a Mac OS X client.
>
> That's true, but it has nothing to do with my example, which shows how
> Tahoe could encounter two names that are identical as Unicode but
> different in POSIX in the same client directory.
>
I'm not a Mac user. If Mac consistently renormalizes to NFD, then
within the Mac, it should be consistent, and could be returned to NFC
when interfacing to Tahoe. But yes, if the Mac talks to a filesystem
that dosen't enforce a consistent Unicode normalization (POSIX), then
that file system could have both styles of normalization... but then
that file system could have both styles of normalization anyway.
If Tahoe enforces a consistent normalization, then it would need a
scheme for dealing with the potential duplications that could result
from file systems that don't.
> > The [zipfile] idea suffers from the same problem as my earlier
> > suggestion of using a separate directory, rather than a prefix, for
> > encoded names... the files get placed in separate buckets, and
> > globs don't work as uniformly.
>
> It's not clear that users will generally want globs to work on broken
> names. If they do, of course a method for "exploding" the file into
> the current directory with some sort of names would be needed. The
> advantage of the zipfile over a directory is precisely that most
> programs that recurse into subdirectories won't do that with the
> zipfile.
>
Clearly that is for the Tahoe users to decide. Encoded names are not
necessarily "broken". I was only pointing out the con. The zip file
idea may or may not be an acceptable solution for them. If it is,
though, so would an extra directory that has the file names encoded
somehow, and the extra directory would be simpler to deal with, having
no need to do unarchiving to access it.
> > I think ISO 9660 limited filenames to A-Z0-9 and 8.3 format. Rock Ridge
> > allows other character sets; I suppose one of the allowable other
> > character sets might be Unicode UTF-8, or POSIX bytes, I haven't looked
> > that up. The Joliet (MS) extension allows UCS-2, except for control
> > characters and 6 blacklisted characters.
> >
> > I don't think the problems correspond particularly well.
>
> Maybe not, but that doesn't mean the solutions won't. This is a hard
> problem, and it's not a new one. Hope springs eternal, but I think it
> unlikely that we'll invent a new scheme that *really works* after all
> these years. At the very least we need to see how people solved
> similar or related problems in the past.
>
The solution for Rock Ridge and Joliet each seem to depend on the
flexibility of the original ISO 9660 system having an "escape" system to
allow alternate names, and each defines a rigid way of using those
alternate names.
Unfortunately, none of the file systems we are talking about do that.
Except, Tahoe _could_. Remember that the %% and %u encoding proposal
that we are responding to is intended to avoid the idea of fragile
metadata that could get lost; an earlier Tahoe proposal was to keep both
a translated or encoded name (of some sort) together with the original
name from the original system in metadata. As a reminder, the cons of
that system, is that once the file is processed on and replaced by a
different system, the original name would be lost, and the original
system might not recognize the translated or encoded name, and the
original name would be lost. Given that the original name was illegal
Unicode, that may or may not be perceived as a catastrophe; it appears
that Tahoe users are divided over the issue, some preferring to keep (or
translate back to) the original name, and other preferring to convert to
Unicode, and keep the name Unicode thenceforth.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the tahoe-dev
mailing list