[tahoe-dev] [Python-Dev] PEP 383 and Tahoe [was: GUI libraries]

Tue May 5 02:04:22 PDT 2009

This is a resend of a personal reply to Zooko.  Please keep me in CC,
at least for now.

I'm going offlist because I'm not on tahoe-dev, and I don't think this
is on-topic for Python-Dev any more.

Zooko O'Whielacronx writes:

 > > Question: is there a way to negotiate versions, or better yet,
 > > features?
 > 
 > For the peer-to-peer protocol there is, but the persistent storage is
 > an inherently one-way communication.

Right, I understand that.

 > But, the writer can write down optional information which will be
 > invisible to readers that don't know to look for it, but adding it
 > into the "metadata" dictionary.

This is what I'm interested in.  I'm not sure how to exploit it yet.

 > A new version of Tahoe writing entries like this is constrained to
 > making the primary key (the filename) be a valid unicode string (if it
 > wants older Tahoe clients to be able to read the directory at all).

OK, understood.

 > However, it is not constrained about what new keys it may add to the
 > "metadata" dict, which is where we propose to add the "failed_decode"
 > flag and the "original_bytes".

Right.

 > >  That's why something like PEP 383 can be important
 > > to you even though it's only a partial solution; it eliminates one
 > > variable.
 > 
 > Would that it were so!  The possibility that PEP 383 could help me or
 > other like me is why I am trying so hard to explain what kind of help
 > I need.  :-)

Well, I'll try to focus on that issue below.

 > Okay here's some more detail.

It makes sense, but I'm not really so much interested in internals of
Tahoe.  Rather, what's implortant for this discussion is what appears
at the interfaces (user-client, client-server).

 > >  > Requirement 1 (unicode):  Each filename that you see needs to be valid
 > >  > unicode
 > >
 > > What does "see" mean?  In directory listings?
 > 
 > Yes, either with "tahoe ls", with a FUSE plugin, wht the web UI.
 > Remove the trailing "?t=json" from the URL above to see an example.
 > 
 > >  Under what
 > > circumstances, if any, can what I see be different from what I get?
 > 
 > This a good question!  In the previous iteration of the Tahoe
 > design, you could sometimes get something from "tahoe cp" which is
 > different from what you saw with "tahoe ls".  In the current design
 > -- http://allmydata.org/trac/tahoe/ticket/534#comment:66 , this is
 > no longer the case, because we abandon the requirement to have
 > "round-trip fidelity of bytes".

But that means that what I saw before with Unix ls is no longer what I
see in Tahoe, *or in the destination Unix system with ls*.  It's not
at all obvious to me that consistency in Tahoe is more important than
the ability to recover consistency at the endpoints.

 > >  > Requirement 3 (no file left behind):  For each filename (byte
 > >  > string) in your myfiles directory, whether or not that byte
 > >  > string is the valid encoding of anything in your stated locale,
 > >  > then that file will be added into the Tahoe filesystem under
 > >  > *some* name (a good candidate would be mojibake, e.g. decode the
 > >  > bytes with latin-1, but that is not the only possibility).
 > >
 > > That's not even a possibility, actually.  Technically, Latin-1 has a
 > > "hole" from U+0080 to U+009F.  You need to add the C1 controls to fill
 > > in that gap.  (I don't think it actually matters in practice,
 > > everybody seems to implement ISO-8859/1 as though it contained the
 > > control characters ... except when detecting encodings ... but it pays
 > > to be precise in these things ....)
 > 
 > Perhaps windows-1252 would be a better codec for this purpose?

No.  You need a binary codec, and in most systems (including Python)
ISO 8859 is implemented that way.  This is actually sort of forward
compatibility to ISO 10646 (aka "Unicode without the algorithms) which
deliberately included the control-0 and control-1 ranges, more or less
to prevent windows-1252 from claiming to be a Unicode subset. :-)

There's also the problem that, unlike ISO 8859-1, the definition of
windows-1252 has been changed at least once (to add the EURO SIGN),
and maybe more often than that.

 > For clarity, assume that the arbitrary unicode filename that Tahoe
 > comes up with is "badly_encoded_filename_#1".  This doesn't change
 > anything in this story.  In particular it doesn't change the fact
 > that there might already be an entry in the directory which is
 > named "badly_encoded_filename_#1" even though it was *not* a badly
 > encoded filename, but a correctly encoded one.

Right.  Somebody else might have independently invented the
convention.

 > Wait, wait.  What good would this do?  The current plan is that if the
 > filenames collide we increment the number at the end "#$NUMBER", if we
 > are just naming them "badly_encoded_filename_#1", or that we append
 > "~1" if we are naming them by mojibake.  And the current plan is that
 > the original bytes are saved in the metadata for future cyborg
 > archaeologists.  How would this complex unicode magic that I don't
 > understand improve the current plan?  Would it provide filenames that
 > are more meaningful or useful to the users than the
 > "badly_encoded_filename_#1" or the mojibake?

1.  The "magic" Unicode filename will be what users of the local
    default encoding are already seeing in Unix ls, up to undecodable
    bytes.

2.  The scheme I describe requires associating the alleged_encoding
    with the file name (actually with individual characters, but in
    practice it will work out the same).

3.  With a "universal" registry, within a session file names that are
    the same across directories encode to the same "magic" Unicode.
    (It's not necessarily true that "magic" Unicodes that differ only
    in private use characters are different.  They might be the same
    abstract characters but have been transcoded across different
    alleged_encodings.  To know if they're the same, you need to know
    the true encoding used to write the filename bytes.)

4.  With any scope of registry, the registry can be copied to the
    target in some convenient place, and you no longer need access to
    Tahoe to do archaeology.

 > > The registry of characters is somewhat unpleasant, but it does allow
 > > you to detect filenames that are the same reliably.
 > 
 > There is no server, so to implement such a registry we would probably
 > have to include a copy of the registry inside each (encrypted,
 > erasure-encoded) directory.

Ah, that's too bad.  Then the registry would have to be directory
specific, I guess.  Would it be possible to make that client-specific
or at least session-specific?

 > >  > Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
 > >  > "round-tripping"):
 > >
 > > PEP 383 gives you this, but you must store the encoding used for each
 > > such file name.
 > 
 > Well, at this point this has become an anti-requirement because it
 > causes the filename as displayed when examining the directory to be
 > different from the filename that results when cp'ing the directory.

You're assuming "tahoe ls" and "tahoe cp".  I doubt that users will
see it that way, especially since in the cases where any of this
discussion is relevant the users will already be seeing mojibake in
Tahoe.

 > Also I don't see why PEP 383's implementation of this would be better
 > than the previous iteration of the design in which this was
 > accomplished by simply storing the original bytes and then writing
 > them back out again on demand, or the design before that in which this
 > was accomplished by mojibake'ing the bytes (by decoding them with
 > windows-1252) and setting a flag indicating that this has been done.

Well, because it's a mojibake that corresponds to the system encoding
*without needing to carry that encoding along*, which users will
probably be familiar with.  For example, I can now recognize the
prefix "Urgent:" in Japanese ISO-2022-JP which has been BASE64
MIME-word-encoded, because it often appears in Subject headers. :-)

Ie

real name              mojibake             Tahoe
      system encoding           Unicode
ABCD ------------------> WXYZ -----------> 0W0X0Y0Z
display encoding:   system |           unicode |
                           V                   V
display                  WXYZ                WXYZ

 > I think I understand now that PEP 383 is better for the case that you
 > can't store extra metadata (such as our failed_decode flag or our
 > original_bytes), but you can ensure that the encoding that will be
 > used later matches the one that was used for decoding now.  Neither of
 > these two criteria apply to Tahoe,

I don't see that the second applies.  Simply add 'alleged_encoding' to
your metadata.  This should be done anyway, since it will be useful to
the future crypto-archaeologists who may have access to witnesses who
can recognize the mojibake in that encoding.

 > and I suspect that neither of them apply to most uses other than
 > the entirely local and non-persistent "for x in os.listdir():
 > open(x)".

It's always possible to attach an 'alleged_encoding' attribute.
Of course that won't be useful to legacy code, which will (should,
anyway) barf on the illegal surrogates.  However, my scheme for a
registry of gaiji (the Japanese for undecodable characters, literally
"outside characters") would make it possible for any
Unicode-compatible program to display them in a conforming way (ie,
displaying as U+FFFD REPLACEMENT CHARACTER, or using an
algorithmically generated font).  Eg, REPLACEMENT CHARACTER would be
displayed as

                                +---+
                                |F F|
                                |F D|
                                +---+

compressed into an appropriate form factor to mix well with the
current normal font.

 > >  > But an even worse problem -- the show-stopper for me -- is that I
 > >  > don't want what Tahoe shows when you do "tahoe ls" or view it in a
 > >  > web browser to differ from what it writes out when you do
 > >  > "tahoe cp -r tahoe: newfiles/".
 > >
 > > But as a requirement, that's incoherent.  What you are "seeing" is
 > > Unicode, what it will write out is bytes.
 > 
 > In the new plan, we write the unicode filename out using Python's
 > unicode filesystem APIs, so Python will attempt to encode it into the
 > appropriate filesystem encoding (raising UnicodeEncodeError if it
 > won't fit).

Well, yes.  But "won't fit" is precisely the case that we're worrying
about, no?  The system that I propose for a gaiji registry will allow
you to simply save the registry somewhere convenient on the target
system, and the crypto-archaeologist won't need access to Tahoe.

You could do the same thing with the bytestrings, but (a) it's more
complex to implement than to simply take the PEP 383 Unicodes, and (b)
the users will not have an easy obvious way to get access to something
invertible to the original bytes.  With PEP 383 (as augmented by my
registry or "raw"), you do (as long as you have 'alleged_encoding').

True, you need to be able to assume that you have control over
surrogates or the private use planes ... but in Tahoe, you do.

 > >   That means that if multiple
 > > locales are in use on both the backup and restore systems, and the
 > > nominal system encodings are different, people whose personal default
 > > locales are not the same as the system's will see what they expect on
 > > the backup system (using system ls), mojibake on Tahoe (using tahoe
 > > ls), and *different* mojibake on the restore system (system ls,
 > > again).
 > 
 > Let's see...  Tahoe is a user-space program and lets Python determine
 > what the appropriate "sys.getfilesystemencoding()" is based on what
 > the user's locale was at Python startup.  So I don't think what you
 > wrote above is correct.  I think that in the first transition, from
 > source system to Tahoe, that either the name will be correctly
 > transcoded

Not in the KOI8-lucky case, if the alleged_encoding is Shift JIS and
the target system's default encoding is EUC-JP.  Then it is
successfully *but incorrectly* transcoded to Unicode, and you'll get
non-KOI8-R bytes in the target system's directory.

 > On the next transition, from Tahoe to system, Tahoe uses the Python
 > unicode API, which will attempt to encode the unicode filename into
 > the local filesystem encoding and raise UnicodeEncodeError if it
 > can't.

So you do what?  Drop the file on the floor?

 > >  > Requirement 5 (no loss of information):  I don't want Tahoe to
 > >  > destroy information -- every transformation should be (in
 > >  > principle) reversible by some future computer-augmented
 > >  > archaeologist.
 > ...
 > > UTF-8b would be just as good for storing the original bytestring, as
 > > long as you keep the original encoding.  It's actually probably
 > > preferable if PEP 383 can be assumed to be implemented in the
 > > versions of Python you use.
 > 
 > It isn't -- Tahoe doesn't run on Python 3.

I don't think the PEP is restricted to Python 3.  Of course it won't
be implemented in the system Pythons, but since it will be implemented
as an error handler, you can take the code implementing the error
handler and include it in Tahoe.  You'll have to specify it explicitly
throughout, of course.

 > Also Tahoe is increasingly interoperating with tools written in
 > completely different languages.  It is much easier for to tell all
 > of those programmers (in my documentation) that in the filename
 > slot is the (normal, valid, standard) unicode, and in the metadata
 > slot there are the bytes than to tell them about utf-8b (which is
 > not even implemented in their tools: JavaScript, JSON, C#, C, and
 > Ruby).  I imagine that it would be a deal-killer for many or most
 > of them if I said they couldn't use Tahoe reliably without first
 > implementing utf-8b for their toolsets.

If you say so.  It's a nearly trivial transformation, quite standard
and well-documented.

 > > NFD is probably better for fuzzy matching and display on legacy
 > > terminals.
 > 
 > I don't know anything about them, other than that Macintosh uses NFD
 > and everything else uses NFC.  Should I specify NFD?  What are these
 > "legacy terminals" of which you speak?  Will NFD make it look better
 > when I cat it to my vt102?  (Just kidding -- I don't have one.)

I don't know whether you *should*.  NFD allows an ASCII-only person to
match "àpropós" [sic, that's not the way the French spell it!] with
"a*pro*po*s" (Unix glob).  A "legacy" terminal is one in which
high-bit-set characters are hard-coded to certain glyphs, and doesn't
understand UTF-8.  Eg, most older xterms.  Those terminals will
display the NFD form as "a__propo__s" (where _ stands for whatever the
terminal does with the relevent high-bit-set characters) which may (or
may not) be more readable than "_prop_s" depending on the relative
frequency of non-ASCII Latin to ASCII.

 > Ah.  What is the Japanese word for "word with some characters right
 > and other characters mojibake!"?  :-)

There isn't one.  Shift JIS, EUC-JP, 7-bit JIS, and Unicode share only
the ASCII code points.  If you get it wrong, you can't read the
Japanese at all.

 > So, it is invertible only if you can assume that the same encoding
 > will be used on the second leg of the trip, right?  Which you can do
 > by writing down what encoding was used on this leg of the trip and
 > forcing it to use the same encoding on the other leg.  Except that we
 > can't force that to happen on Windows at all as far as I understand,

That's because the Windows API used forces it to be Unicode.  Is that
a problem?  Does Tahoe support Windows systems that don't support the
Unicode APIs properly (including "not at all")?  If not, I think it's
a non-problem.

 > which is a show-stopper right there.  But even if we could, this would
 > require us to write down a bit of information and transmit it to the
 > other side and use it to do the encoding.

But you only have to do that in the case where the original decoding
*failed*.  And in that case you want to do it anyway.

 > And if we are going to do that, why don't we just transmit the
 > original bytes?

Because bytes are user-unfriendly.  Providing a user interface to
bytes requires the programmer to decode them, and the interpretation
becomes unclear.  Keeping track of this seems harder to me than
implementing UTF-8b.

 > the name.  One of those options has the advantage of simplicity to the
 > programmer ("There is the unicode, and there are the bytes."), and the
 > other has the advantage of good compression.  Both of them have the
 > advantage that nobody involved has to understand and possibly
 > implement a non-standard unicode hack.

 > I'm trying not to be too pushy about this (heaven knows I've been
 > completely wrong about things a dozen times in a row so far in this
 > design process), but as far as I can understand it, PEP 383 can be
 > used only when you can force the same encoding on both sides (the PEP
 > says that encoding "only 'works' if the data get converted back to
 > bytes with the python-escape error handler also").  That happens
 > naturally when both sides are in the same Python process, so PEP 383
 > naturally looks good in that context.

Or both are using the same defined protocol.

 > However, if the filenames are going to be stored persistently or
 > transmitted over a network, then it seems simpler, easier, and more
 > portable to use some other method than PEP 383 to handle badly
 > encoded names.

Assuming you don't plan to specify the protocol to deal with them, and
leave that up to the other implementers.  If you do specify such a
protocol, I don't see why PEP 383 is going to be harder to implement
than any others.  Remember, the reason we're having this discussion is
that this is not easy.

 > >  > them over a network.  Indeed, using the results that way could lead
 > >  > to unpleasant surprises.
 > >
 > > No more than any other system for giving a canonical Unicode spelling
 > > to the results of an OS call.
 > 
 > I think PEP 383 yields more surprises than the alternative of decoding
 > with error handler 'replace' and then including the original bytes
 > along with the unicode.

You think?  I think PEP 383 surprises programmers.  They'll have to
deal with the illegal surrogates.  The 'replace' method will surprise
*users*, and they'll be dependent on programmers who you claim will
balk at implementing UTF-8b.  Do you *really* think they'll turn
around and implement a consistent robust interface to the bytes?

I don't.

 > Any of these three seem to be less surprising and similarly
 > functional to PEP 383.

Sure, theyve got the needed data structures, but what they're all
missing is a protocol.  You're leaving that up to the implementation.
Good luck! :-)

 > I'm still being surprised by it after trying to understand it for many
 > days now.  For example, what happens if you decode a filename with PEP
 > 383, store that filename somewhere,

That's not conforming to Unicode spec.  Microsoft does that, but you
shouldn't.

For Python, this comes under "consenting adults."  Tahoe should
definitely not consent.

 > and then later try to write a file under that name on Windows?  If
 > it only 'works' if the data get converted back to bytes with the
 > python-escape error handler, then can you use the python-escape
 > error handler when trying to, say, create a new file on Windows?

I believe so; from the discussion of PEP 383 it seems Windows will
allow you to do that.  However, you do risk collisions in that case.
(But that's always true, and is rooted in the fact that Windows isn't
fully Unicode conformant.)

The gaiji registry scheme avoids that.