[tahoe-dev] [Python-Dev] PEP 383 and Tahoe [was: GUI libraries]

Sun May 3 02:32:38 PDT 2009

This is a resend of a reply which bounced off tahoe-dev because I
am not a member.  Please keep me in CC, at least for now.

Zooko O'Whielacronx writes:

 > However, it is moot because Tahoe is not a new system. It is currently
 > at v1.4.1, has a strong policy of backwards-compatibility, and already
 > has lots of data, lots of users, and programmers building on top of
 > it.

Cool!

Question: is there a way to negotiate versions, or better yet, features?

 > I see I'm not explaining the Tahoe requirements clearly. It's probably
 > that I'm not understanding them clearly myself.

Well, it's a high-dimensional problem.  Keeping track of all the
variables is hard.  That's why something like PEP 383 can be important
to you even though it's only a partial solution; it eliminates one
variable.

 > Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system
 > and then you inspect the files in the Tahoe filesystem, such as by
 > examining the web interface [1] or by running "tahoe ls", either of
 > which you could do either from the same machine where you ran "tahoe
 > cp" or from a different machine (which could be using any operating
 > system). We have the following requirements about what ends up in your
 > Tahoe directory after that cp -r.

Whoa! Slow down!  Where's "my" "Tahoe directory"?  Do you mean the
directory listing?  A copy to whatever system I'm on?  The bytes that
the Tahoe host has just loaded into a network card buffer to tell me
about it?  The bytes on disk at the Tahoe host?  You'll find it a lot
easier to explain things if you adopt a precise, consistent terminology.

 > Requirement 1 (unicode):  Each filename that you see needs to be valid
 > unicode

What does "see" mean?  In directory listings?  Under what
circumstances, if any, can what I see be different from what I get?

 > Requirement 2 (faithful if unicode):  For each filename (byte string)
 > in your myfiles directory,

My local myfiles directory, or my Tahoe myfiles directory?

 > if that bytestring is the valid encoding of some string in your
 > stated locale,

Who stated the locale?  How?  Are you referring to what
getfilesystemencoding returns?  This is a "(unicode) string", right?

 > then the resulting filename in Tahoe is that (unicode)
 > string. Nobody ever doesn't want this, right?  Well, maybe some
 > people don't want this sometimes, [...]. However, what's the
 > alternative?  Guessing that their locale shouldn't be set to
 > latin-1 and instead decoding their bytes some other way?

Sure.  Emacsen do that, you know.  Of course it's hard to guess
something else if ISO-8859/1 is the preferred encoding, but it does
happen.  This probably cannot be done accurately enough for Tahoe,
though.

 > It seems like we're not going to do better than
 > requirement 2 (faithful if unicode).
 > 
 > Requirement 3 (no file left behind):  For each filename (byte string)
 > in your myfiles directory, whether or not that byte string is the
 > valid encoding of anything in your stated locale, then that file will
 > be added into the Tahoe filesystem under *some* name (a good candidate
 > would be mojibake, e.g. decode the bytes with latin-1, but that is not
 > the only possibility).

That's not even a possibility, actually.  Technically, Latin-1 has a
"hole" from U+0080 to U+009F.  You need to add the C1 controls to fill
in that gap.  (I don't think it actually matters in practice,
everybody seems to implement ISO-8859/1 as though it contained the
control characters ... except when detecting encodings ... but it pays
to be precise in these things ....)

 > Now already we can say that these three requirements mean that there
 > can be collisions -- for example a directory could have two entries,
 > one of which is not a valid encoding in the locale, and whatever
 > unicode string we invent to name it with in order to satisfy
 > requirements 3 (no file left behind) and 1 (unicode) might happen to
 > be the same as the (correctly-encoded) name of the other file.

This is false with rather high probability, but you need some extra
structure to deal with it.  First, claim the Unicode private planes
for Tahoe.  Then allocate characters from the private planes on demand
as encountered, *including* such characters encountered in external
file names to be stored in Tahoe *and* the surrogates used by PEP
383.  "Display names" using these private characters would be valid
Unicode, but not very useful.  However, an algorithmically generated
font (like the 4-hex-digit-square used to give a glyph to unknown code
points in the BMP) could be used by those who care.

Also store mappings from (system encoding, UTF-8b representation) to
private char and back.  For simplicity, that could be global on your
server (IIRC, there are at least two private planes up there, so you'd
need to run into almost 128Ki *unique* such characters to run out).

I guess you'd be subject to a DOS attack where somebody decided to map
all of 80000-odd CNS characters into private space, and then write
80000 files, each with a different 1-character name ....

Note that Martin does *not* do this in PEP 383 because PEP 383 only
cares about the semantics that a filename read from a directory can be
used to access the file associated with it in that directory.  For
that, a private, non-Unicode encoding is perfectly acceptable.  But
you want valid Unicode.  This scheme gives it to you.

The registry of characters is somewhat unpleasant, but it does allow
you to detect filenames that are the same reliably.

 > Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
 > "round-tripping"):

PEP 383 gives you this, but you must store the encoding used for each
such file name.

 > One reason to be skeptical is that about a third of the Russian
 > files will happen to decode cleanly as shift-jis anyway, and will
 > therefore come out as something entirely different if the target
 > filesystem's encoding is something other than shift-jis.

The only way to handle this is to store the encoding used to convert
to Unicode as part of *every* file's metadata.  This could be also
used in Tahoe to warn the user that the current system encoding does
not match the alleged_encoding used to make the backup.  Some users
might prefer to use the alleged_encoding on restore.

 > But an even worse problem -- the show-stopper for me -- is that I
 > don't want what Tahoe shows when you do "tahoe ls" or view it in a
 > web browser to differ from what it writes out when you do "tahoe cp
 > -r tahoe: newfiles/".

But as a requirement, that's incoherent.  What you are "seeing" is
Unicode, what it will write out is bytes.  That means that if multiple
locales are in use on both the backup and restore systems, and the
nominal system encodings are different, people whose personal default
locales are not the same as the system's will see what they expect on
the backup system (using system ls), mojibake on Tahoe (using tahoe
ls), and *different* mojibake on the restore system (system ls,
again).

Note that "use Tahoe, not system, ls" doesn't help at all (unless the
weirdo has learned to read mojibake, which actually does happen, but
it's not worth betting on).

How likely is that?  Hate to tell you this: if you need the "unknown
bytes scheme at all, this scenerio is *extremely* likely.  How do you
think that KOI8-R got into a directory on a Shift-JIS system in the
first place?  Yup, a Russian visiting professor in Tokyo who set his
personal locale to ru_RU.KOI8-R wrote it there.  And he's very likely
to have the same personal locale on a very up-to-date system with a
UTF-8 system encoding when he gets back to Moscow.  Bingo! it's
mojibake all the way to Moscow.

 > Now about the "metadata" part which is separate from the filename
 > itself. I have another requirement:
 > 
 > Requirement 5 (no loss of information):  I don't want Tahoe to destroy
 > information -- every transformation should be (in principle)
 > reversible by some future computer-augmented archaeologist. For
 > example, if a bytestring decodes cleanly with the locale's suggested
 > encoding, and we use the resulting unicode as the filename, then we
 > also store the original byte string in the metadata since we don't
 > know if the locale's suggested encoding was good.

UTF-8b would be just as good for storing the original bytestring, as
long as you keep the original encoding.  It's actually probably
preferable if PEP 383 can be assumed to be implemented in the versions
of Python you use.

 > This allows the later invention of a tool

It will be called "Emacs", by the way.<wink>

 > which shows the user what the filename would
 > have been with other encodings and let the user choose one that makes
 > sense.

 > To copy an entry from a local filesystem into Tahoe:
 > 
 > 1. On Windows or Mac read the filename with the unicode APIs.
 > Normalize the string with filename = unicodedata.normalize('NFC',
 > filename). Leave the "original_bytes" key and the "failed_decode" flag
 > out of the metadata.

NFD is probably better for fuzzy matching and display on legacy
terminals.

 > 2. On Linux or Solaris read the filename with the string APIs, and
 > store the result in the "original_bytes" part of the metadata. Call
 > sys.getfilesystemencoding() to get an alleged_encoding. Then, call
 > bytes.decode(alleged_encoding, 'strict') to try to get a unicode
 > object.
 > 
 > 2.a. If this decoding succeeds then normalize the unicode filename
 > with filename = unicodedata.normalize('NFC', filename), store the
 > resulting filename and leave the "failed_decode" flag out of the
 > metadata.

Per the koi8-lucky example, you don't know if it succeeded for the
right reason or the wrong reason.  You really should store the
alleged_encoding used in the metadata, always.

Note that you should *also* store the failed_decode flag, because the
presence of multiple fail_decodes is a very strong indication that
some of the users had default encoding != system encoding.  If you use
the scheme I propose above, of course you have the same information
by scanning the file name for Tahoe-only private use characters, but
that would be relatively expensive.

 > 2.b. If this decoding fails, then we decode it again with
 > bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
 > resulting unicode object into the "filename" part, set the
 > "failed_decode" flag to True. This is mojibake!

Not necessarily.  Most ISO-8859/X names will fail to decode if the
alleged_encoding is UTF-8, for example, but many (even for X != 1)
will be correctly readable because of the policy of trying to share
code points across Latin-X encodings.  Certainly ISO-8859/1 (and
much ISO-8859/15) will be correct.

 > 3. (handling collisions)  In either case 2.a or 2.b the resulting
 > unicode string may already be present in the directory. If so, check
 > the failed_decode flags on the current entry and the new entry. If
 > they are both set or both unset then the new entry overwrites the old
 > entry -- they had the same name.

If both are set, you're OK, because you are forcing ISO-8859/1.  If
both are unset, however, you don't know for sure because
alleged_encoding is not necessarily a constant.

 > To copy an entry from Tahoe into a local filesystem:
 > 
 > Always use the Python unicode API. The original_bytes field and the
 > failed_decode field in the metadata are not consulted.
 > 
 > Now a question for python-dev people: could utf-8b or PEP 383 be
 > useful for requirements like the four requirements listed above?  If
 > not, what requirements does PEP 383 help with?

By giving you a standard, invertible way to represent anything that
the OS can throw at you, it helps with all of them.

 > I'm not sure that it can help if you are going to store the results
 > of your os.listdir() persistently or if you are going to transmit
 > them over a network.  Indeed, using the results that way could lead
 > to unpleasant surprises.

No more than any other system for giving a canonical Unicode spelling
to the results of an OS call.

_______________________________________________
Python-Dev mailing list
Python-Dev at python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/stephen%40xemacs.org