[tahoe-dev] File naming on POSIX and Windows clients [was: PEP 383 update: ...]
Stephen J. Turnbull
stephen at xemacs.org
Sat May 9 09:58:44 PDT 2009
Glenn Linderman writes:
> > While great effort to disambiguate the notation is made, in the end
> > Tahoe only controls Tahoe filenames ... but there is no problem with
> > them, since they are well-specified as Unicode.
>
> Well, Stephen, you are correct that there is no problem with Tahoe
> filenames... except that the fact that they are restricted to Unicode,
> and POSIX filenames are not, _is_ a problem.
Sure, but it's a *solved* problem (surrogate-escape coding systems do
it simply, a PU character registry does it in a more complicated way).
Tahoe doesn't seem to like those schemes, too bad for Tahoe -- but
it's not *our* problem in this thread.
> As presently defined, %% notation has problems, I agree. And if other
> programs get in the act of interpreting the names, and trying to
> re-encode them, "just like Tahoe would"
You might have a hope if the intent was to emulate Tahoe. But
those names may get munged by other transports etc. and people will
undoubtedly be using ad hoc algorithms.
> I question how many programs, faced with apparently URL-encoded
> filenames, actually attempt to URL-decode the name. Most of what
> I've seen is that the names simply linger, containing their
> URL-encoding, and looking ugly.
I decode such on an ad hoc basis all the time. I suspect other users
in non-Latin locales will do so, too.
> At this point, it is appropriate to point out that the transcoding
> algorithms between Tahoe and any particular non-Tahoe system need not be
> the same as the transcoding algorithms between Tahoe and any other
> particular non-Tahoe system.
I don't think you want to go there. That will confuse the heck out of
multihomed users, who would at least like to see the same mojibake on
different systems.
> > The Unicode normalization proposed by several of the authors has
> > (probably solvable) issues, especially since NFC is chosen. The
> > problem is that an NFC name may fail to roundtrip *via other
> > utilities* with a Mac in the middle. On several occasions I've found
> > myself looking at two files with the same name on a Linux system
> The Unicode normalization issues for a specific platform can be solved
> by the Tahoe client programs created for that platform. In other words,
> NFD names found on Mac OS X can be renormalized to NFC by Tahoe client
> programs, or upon receipt by a Tahoe server that knows it is talking to
> a Mac OS X client.
That's true, but it has nothing to do with my example, which shows how
Tahoe could encounter two names that are identical as Unicode but
different in POSIX in the same client directory.
> The [zipfile] idea suffers from the same problem as my earlier
> suggestion of using a separate directory, rather than a prefix, for
> encoded names... the files get placed in separate buckets, and
> globs don't work as uniformly.
It's not clear that users will generally want globs to work on broken
names. If they do, of course a method for "exploding" the file into
the current directory with some sort of names would be needed. The
advantage of the zipfile over a directory is precisely that most
programs that recurse into subdirectories won't do that with the
zipfile.
> I think ISO 9660 limited filenames to A-Z0-9 and 8.3 format. Rock Ridge
> allows other character sets; I suppose one of the allowable other
> character sets might be Unicode UTF-8, or POSIX bytes, I haven't looked
> that up. The Joliet (MS) extension allows UCS-2, except for control
> characters and 6 blacklisted characters.
>
> I don't think the problems correspond particularly well.
Maybe not, but that doesn't mean the solutions won't. This is a hard
problem, and it's not a new one. Hope springs eternal, but I think it
unlikely that we'll invent a new scheme that *really works* after all
these years. At the very least we need to see how people solved
similar or related problems in the past.
More information about the tahoe-dev
mailing list