[tahoe-dev] #534: "tahoe cp" command encoding issue

Thu Feb 26 09:56:02 PST 2009

So I *had* been thinking that tahoe should do what Francois's patch  
currently does:

Strategy 1: decode the filename using the declared codec of the  
filesystem, if that fails, raise an exception

However Andrej's "crude user logic" (;-)) has shown the problem with  
that strategy.

I now think that we should do:

Strategy 2: decode the filename using the declared codec of the  
filesystem, if that fails, just copy the bytes without decoding them.

And, we must mark down somewhere that this is a "just the bytes"  
filename instead of a utf-8 encoded filename.  I think the easiest  
place to mark this down might be to add a flag to the "metadata" dict  
associated with that name, something like "unknown_codec: True".

I no longer think that we should try to decode the filename with  
codecs other than the one suggested by the system.  If we pass the  
binary bytes through, then the user on the other side can attempt  
such guessing.  If tahoe guesses, it doesn't give the other side  
information that the other side couldn't have figured out for itself,  
and it risks destroying information (when tahoe guesses and gets an  
apparent success which was actually wrong).

Note that this strategy could cause failures in older tahoe clients  
which are expecting utf-8 encoded names in the name field.  They  
could get a decode error.  Newer tahoe clients would know to check  
for the "unknown_codec" flag before decoding.  Hm -- that doesn't  
sound good.  I can think of three options:

Strategy 2.a.  ... if that fails, copy the bytes into the "name" slot  
and add a flag to the metadata saying that name isn't a normal utf-8  
encoded name (this is what I suggested in the previous paragraph)

Strategy 2.b.  ... if that fails, put some placeholder, like "?1", "? 
2", "?3", etc. in the name slot, and put the bytes into the metadata  
in a "name_bytes" field.  Old tahoe clients (or very simple ones) end  
up getting the incrementing "?N" names, smarter tahoe clients check  
for the "name_bytes" field first. and if there is anything there then  
they use the name_bytes and do their best to represent them to the  
user, and they don't use the "?N" placeholder at all.

Strategy 2.c.  If it fails, encode the bytes in some magical way that  
a later utf-8 decoding of them will get the same bytes back.  This  
might be the hack that Brian suggested that KDE uses to shovel  
undecodable strings into some unused corner of the unicode space -- I  
didn't really understand that idea.  This smells to me like the same  
sort of slop which created these problems in the first place (trying  
to shoehorn semantically incompatible things into the same bits  
without explicit flagging).  If we did this, then for example python  
code which called .decode() on that string would get back a unicode  
object which didn't actually contain unicode chars, but contains  
bytes in some unknown encoding.  Hopefully we don't need to do this  
since some other strategy ought do better.

Okay, folks, what do you think?  One of these Strategy 2 options, or  
yet a different Strategy?

By the way, Andrej, the reason that we were earlier proposing to do  
Strategy 1, which Francois's patch implemented, and which rejects  
yout filename is because Tahoe can't know whether that filename will  
come out as gibberish in certain views, such as the ls/nautilus/ 
konqueror that you mentioned, or if you share the file with a  
friend.  However, I guess in this case it is better to pass the data  
through and let it be Someone Else's Problem.

Regards,

Zooko