[tahoe-dev] [Python-Dev] PEP 383 update: utf8b is now the error handler

Wed May 6 15:13:13 PDT 2009

On approximately 5/6/2009 1:58 PM, came the following characters from 
the keyboard of Zooko Wilcox-O'Hearn:
> Thank you for your interesting message about this perplexing issue.  I 
> haven't yet read all of your message, but just so you know I went and 
> added v+python at g.nevcal.com to the list of senders whose posts are 
> automatically approved to go to tahoe-dev.  Please feel free to Cc: 
> tahoe-dev at allmydata.org in the future, and if you would be willing to 
> resend your message (quoted below) to tahoe-dev I would appreciate it.
> 
> Regards,
> 
> Zooko

I'll go ahead and resend to your list.  I did read the other message 
about requirements that you mentioned in your other response: 
<http://allmydata.org/pipermail/tahoe-dev/2009-May/001714.html>

Per that message, I would say that my "Uncertainty 2)" applies.

I'll make a few more comments at the bottom, based on reading the 
message at the above link.

Because it sounds like an interesting project, I'm willing to read and 
comment on any emails on this topic that are Cc:'d to me, if and when my 
interest causes me to spend time that should probably be spent doing 
something else, but I don't have time to join the group, and go looking 
for them.

> On May 6, 2009, at 14:17 PM, Glenn Linderman wrote:
> 
>> On approximately 5/6/2009 12:18 PM, came the following characters from 
>> the keyboard of Zooko Wilcox-O'Hearn:
>>> On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:
>>>> Zooko Wilcox-O'Hearn <zooko <at> zooko.com> writes:
>>>>>
>>>>> I'm not thinking of API compatibility as much as data compatibility 
>>>>> -- someone used Python 3.1 to write down some filenames, and now a 
>>>>> few years later they are trying to use the latest and greatest 
>>>>> Python release to read those filenames...
>>>>
>>>> Well, if the filenames are generated by Python (as opposed to read 
>>>> from an existing directory on disk), they should be regular unicode 
>>>> objects without any lone surrogates, so I don't see the 
>>>> compatibility problem.
>>> I meant that the application reads filenames from an existing 
>>> directory on disk, saves those filenames, and then later, using a 
>>> future version of Python, wants to read them and use them.
>>
>>
>> Regarding future versions of Python.  In the worst case, even if 
>> Python's default behavior changes, the transcoding done by PEP 383 can 
>> be done in other software too... it is a straightforward, fully 
>> specified, 1-to-1, reversible transcoding process, affecting and 
>> generating only invalid byte encodings on one side, and invalid 
>> Unicode sequences on the other.
>>
>> So if Python's default behavior should change, the transcoding 
>> implemented by PEP 383 could be easily reimplemented to enable a 
>> future version of a Python application to manipulate the transcoded, 
>> saved, filenames.
>>
>> By easily, I mean that I could code it in a couple hours, max.
>>
>>
>>> I'm not saying that I know this would be a problem.  I'm saying that 
>>> I personally can't tell whether it would be a problem or not, and the 
>>> extensive discussions so far have not convinced me that there is 
>>> anyone who both understands PEP 383 and considers this use case.
>>
>>
>> Does the above help?
>>
>>
>>> Many people who apparently understand encoding issues well have said 
>>> something to the effect that there is no problem, but those people 
>>> haven't yet managed to get through my thick skull how I would use PEP 
>>> 383 safely for this sort of use case -- the one where data generated 
>>> by os.listdir() travels forward in time or the one were that data 
>>> travels sideways to other systems, including Windows or other systems 
>>> that validate incoming unicode.
>>
>>
>> Regarding data traveling sideways, some comments:
>>
>> 1) PEP 383's effect could be recoded in other languages as easily as 
>> it is in Python (or the C in which Python is implmented).  So that 
>> could be a solution.
>>
>> 2) You mention "Windows" and "other systems that validate incoming 
>> unicode" in the same phrase, as if you think that "Windows" qualifies 
>> as  an "other systems that validate incoming unicode", but it does not 
>> (at least not universally).
>>
>>
>>> That's why I am a bit uncomfortable about PEP 383 being quickly 
>>> implemented and deployed in Python 3.1.
>>
>>
>> Does the above help?
>>
>>
>>> By the way, much of the detailed discussion about what Tahoe requires 
>>> and how that may or may not benefit from PEP 383 has now moved to the 
>>> tahoe-dev mailing list: 
>>> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .
>>
>>
>> I have no background with Tahoe, nor particular interest, although it 
>> sounds like a useful project... so I won't be joining that list.  I 
>> have no idea if there is an installed base of existing Tahoe file 
>> systems, my suggestions below assume that there is not, and that you 
>> are presently inventing them.  Therefore, I provide no migration path, 
>> although I could invent one, but it would take longer to describe.
>>
>> However, since I'm responding here, and have read what you have posted 
>> here, it seems like the following could be true.
>>
>> Assumptions from your emails:
>>
>> A) Tahoe wants to provide a UTF-8 file name system
>> B) Tahoe wants to interface to POSIX systems that use (and do not 
>> validate) byte interfaces.
>> C) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
>> name interfaces, with no validation.
>> D) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
>> name interfaces, with validation.
>>
>> Uncertainties: I'm not clear on what your goals are for Tahoe 
>> filenames.  There seem to be 2 possibilities:
>>
>> 1) you want to reject attempts to use non-validating Unicode, be it 
>> from a 16-bit interface, or a bytes interface.
>> 2) you don't want to reject non-validating Unicode, but you want to 
>> convert it to valid Unicode for (D) systems.
>>
>> 3) Orthogonally, you might want to store only Valid Unicode in the 
>> names, or you might not care, if you can meet the other goals.
>>
>> Truisms:
>>
>> If you want to support (D), and (2), then you must transform names at 
>> some point, using some scheme, because not all names supplied by (B) 
>> systems will be acceptable to (D) systems.  You can choose to do this 
>> transformation when a (B) system provides an invalid (per Unicode) 
>> name, or you can choose to do the transformation when a (D) system 
>> accesses a file with an invalid (per Unicode) name.
>>
>> If the (B) and (D) systems talk to each other outside of Tahoe, they 
>> will have to do similar transformations, or, if they both access the 
>> same Tahoe system, they will have to do the identical transformation, 
>> to be sure that they can access the same file.
>>
>> All transcoding schemes have the possibility of data puns between 
>> non-transcoded names and transcoded names.  In order to successfully 
>> and properly manipulate a name, you must know whether or not it has 
>> been transcoded, and how.
>>
>> PEP 383 limits its transcoding to names that are invalid (per 
>> Unicode).   Names that cannot be properly decoded to Unicode are 
>> decoded to invalid Unicode.  Names that are invalid Unicode are 
>> encoded to invalid byte sequences (per the encoding scheme specified).
>>
>> For PEP 383 and Python, transcoded names can be distinguished by 
>> checking for the existence of lone surrogates in the str form of the 
>> filename, or by attempting to do a strict decoding of the bytes form 
>> of the filename, depending on what you have (generally, the former).
>>
>> For PEP 383 and Python, the names will round trip from the POSIX bytes 
>> interfaces to the program, and back to POSIX bytes interfaces, as long 
>> as only Python wrappers of system functions are used, and the 
>> filesystem encoding is not changed between calls (or is restored).  
>> Passing them to 3rd party libraries or other systems requires extra 
>> work, if there is a desire to manipulate files with names that are not 
>> decodeable to Unicode by the standard decoding algorithm for that 
>> encoding.

Comments about your interfaces (quote from the linked message you sent):

> Then, there are at least five interfaces for connecting the WAPI up to
> other things:

1 & 2 sound like special purpose client programs.  Such programs can 
access APIs beyond a mapping of file-system APIs, and do name 
validations, transcodings, and any other necessary tasks that help.

For interface 3, Windows CIFS/SMB, you need to make sure to validate the 
incoming 16-bit codes for Unicode validity.  Windows doesn't.  Windows 
won't supply certain characters in file names that it considers illegal, 
which include at least  : \ ? * (I think there are a few more also, the 
list is documented).  You need to have a plan for what to do when a 
non-Windows system creates a filename that may be legal Unicode, but is 
not a legal Windows filename.

I know nothing about interfaces 4, 5, or 6.

Comments about your requirements (quote from the linked message you sent):

> Okay, that's the setting, now the five possible requirements are:

Requirement 1 makes it sound like you want to always store a valid 
Unicode filename.  I think that is a good thing, overall.

Requirement 2 makes it sound like you want to decode bytes to Unicode 
using the current filesystem encoding, on POSIX systems.  Because you 
only want valid Unicode, it sounds like you would not benefit from PEP 
383, which produces invalid Unicode.  However, if you are running (in 
the future) on a POSIX system that uses PEP 383, you would either need 
to use the bytes interfaces, and do your own strict Unicode decoding, or 
use the str interfaces, but validate that the result contains no lone 
surrogates.  Either of these would enable you to determine the faithful 
unicode if decodable case.

Requirement 3 requires some sort of transcoding, you could start from 
either the original bytes, or the PEP 383 invalid str.  If you produce a 
name that contains only valid Unicode, then it will match what could be 
a valid Unicode name that was produced in other ways (by the user typing 
it, for example).  If you have collisions, you will not know whether the 
two names were supposed to be the same, or were supposed to be 
different, except that you could keep track of the fact that one was 
generated by transcoding, and the other not.  But not all of your 
interfaces (particularly Windows CIFS/SMB) will be able to access that 
information, or use it in any meaningful manner.  So you have the 
benefit of having readable, valid Unicode names, but you have the cost 
of having data puns.  The only scheme I can think of for transcoding in 
this manner, is to have a reserved directory (that lends itself to path 
name puns, too, of course) for names that have been transcoded, such 
that /foo and /transcoded/foo are different, even though they otherwise 
look the same.  This would be cumbersome, and would require client 
programs using filesystem interfaces to either not see those names, or 
have to go looking for them in particular.  Of course, the whole issue 
could be avoided by people using only valid unicode names.

Requirements 4 & 5 can only be met for files initially created invalid 
Unicode names (via extra metadata), or via a scheme like the reserved 
directory, and a reversible transformation.  Otherwise, CIFS/SMB access 
could make a copy, not copy the extra metadata (which cannot be 
available on that interface), and delete the original.  Copying the copy 
back to the original client wouldn't find the metadata to know what the 
original name was.

Whatever the scheme, configuration or command-line options could 
"tighten" the restrictions, so that only valid Unicode names would be 
acceptable.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking