[tahoe-dev] #534: "tahoe cp" command encoding issue
zooko
zooko at zooko.com
Thu Feb 26 09:56:02 PST 2009
So I *had* been thinking that tahoe should do what Francois's patch
currently does:
Strategy 1: decode the filename using the declared codec of the
filesystem, if that fails, raise an exception
However Andrej's "crude user logic" (;-)) has shown the problem with
that strategy.
I now think that we should do:
Strategy 2: decode the filename using the declared codec of the
filesystem, if that fails, just copy the bytes without decoding them.
And, we must mark down somewhere that this is a "just the bytes"
filename instead of a utf-8 encoded filename. I think the easiest
place to mark this down might be to add a flag to the "metadata" dict
associated with that name, something like "unknown_codec: True".
I no longer think that we should try to decode the filename with
codecs other than the one suggested by the system. If we pass the
binary bytes through, then the user on the other side can attempt
such guessing. If tahoe guesses, it doesn't give the other side
information that the other side couldn't have figured out for itself,
and it risks destroying information (when tahoe guesses and gets an
apparent success which was actually wrong).
Note that this strategy could cause failures in older tahoe clients
which are expecting utf-8 encoded names in the name field. They
could get a decode error. Newer tahoe clients would know to check
for the "unknown_codec" flag before decoding. Hm -- that doesn't
sound good. I can think of three options:
Strategy 2.a. ... if that fails, copy the bytes into the "name" slot
and add a flag to the metadata saying that name isn't a normal utf-8
encoded name (this is what I suggested in the previous paragraph)
Strategy 2.b. ... if that fails, put some placeholder, like "?1", "?
2", "?3", etc. in the name slot, and put the bytes into the metadata
in a "name_bytes" field. Old tahoe clients (or very simple ones) end
up getting the incrementing "?N" names, smarter tahoe clients check
for the "name_bytes" field first. and if there is anything there then
they use the name_bytes and do their best to represent them to the
user, and they don't use the "?N" placeholder at all.
Strategy 2.c. If it fails, encode the bytes in some magical way that
a later utf-8 decoding of them will get the same bytes back. This
might be the hack that Brian suggested that KDE uses to shovel
undecodable strings into some unused corner of the unicode space -- I
didn't really understand that idea. This smells to me like the same
sort of slop which created these problems in the first place (trying
to shoehorn semantically incompatible things into the same bits
without explicit flagging). If we did this, then for example python
code which called .decode() on that string would get back a unicode
object which didn't actually contain unicode chars, but contains
bytes in some unknown encoding. Hopefully we don't need to do this
since some other strategy ought do better.
Okay, folks, what do you think? One of these Strategy 2 options, or
yet a different Strategy?
By the way, Andrej, the reason that we were earlier proposing to do
Strategy 1, which Francois's patch implemented, and which rejects
yout filename is because Tahoe can't know whether that filename will
come out as gibberish in certain views, such as the ls/nautilus/
konqueror that you mentioned, or if you share the file with a
friend. However, I guess in this case it is better to pass the data
through and let it be Someone Else's Problem.
Regards,
Zooko
More information about the tahoe-dev
mailing list