[tahoe-dev] #534: "tahoe cp" command encoding issue

Fri Feb 27 12:16:39 PST 2009

On Feb 27, 2009, at 10:45 AM, Brian Warner wrote:

> [must be brief, typing on an iphone, I'll write more on Monday when  
> I've got a real keyboard]

... he said before writing a note as thorough and detailed as most  
programmers ever write.

> On the inbound side, if we can't decode the filename with the  
> user's preferred encoding (which can default to utf-8, or utf-16 on  
> windows, or something configured into python, etc)

Fortunately on Windows all filenames really are utf-16-encoded (or  
UCS-2, or whatever encoding it is that the filesystem specifies), so  
you'll never get a decode error, nor a silent misdecoding to random  
gibberish.  (Insert joke here about treating unix as a first class  
citizen even though it doesn't deserve it.)

> then we pretend to decode it with Latin-1, so that a human looking  
> at the mangled unicode name can hopefully guess what the proper  
> name should have been. We use the unicode result as the childname.  
> In all cases, we store the orginal bytestring in the metadata.

As I understand it from Shawn and Kevin, taking an arbitrary byte  
string and decoding it with latin-1 to produce a unicode object is  
lossless -- a subsequent encode of that unicode object with latin-1  
will always yield the same bytes.  Is that right?

In that case, we don't need the separate base32-encoded bytestring,  
just the flag to say whether the child name element was the result of  
a successful decode using the encoding declared by the filesystem, or  
else the result of a "fallback" latin-1 decode.

This simpler approach *does* mean that we lose information whenever  
there is a file which isn't *actually* encoded in the declared  
encoding of the local filesystem, but which happens to decode when  
you try.  However, I'm not sure it is worth the complexity of  
preserving the bytes of that file's name (which after all nobody else  
can decode either except by guessing at encodings).  Also, note that  
almost certainly the local user examing that local filename with his  
local tools will see the gibberish that results from decoding that  
name with his local filesystem encoding, raising the question of what  
"actually" actually means in the previous sentence.

So I propose Strategy 2.d (but who's counting?):

Decode the filename with the declared encoding.  If that succeeds,  
then put that unicode string (utf-8 encoded) into the child name and  
set the flag "latin_1_fallback: False".  If that fails then decode  
the filename with latin-1 (which can't fail) then put that unicode  
string (utf-8 encoded) into the child name and set the flag  
"latin_1_fallback: True".

Now old tahoe clients (or lazy new ones), will just get the child  
name bytes, utf-8 decode them to get a unicode string, and use it.   
It will either be right, or it will be the gibberish that you get  
from interpreting whatever-it-originally-was as latin-1.

New and diligent tahoe clients will check the "latin_1_fallback" flag  
first.  If it is False, they proceed as before, knowing that they're  
getting the right name.  If it is True, then they take the unicode  
object (which they got by utf-8-decoding the child name bytes), and  
they encode it with latin-1.  This gives them back the original bytes  
(right?).  Now they do whatever diligent tahoe clients do with the  
original bytes of a filename in an unknown encoding.

This seems simpler to me than your proposal, but I'm not sure if I  
understood everything in your proposal, so I'm not sure if there is  
something that this proposal wouldn't do as well.

Please everyone who understands this let me know if this would work.

Regards,

Zooko