[tahoe-dev] #534: "tahoe cp" command encoding issue
zooko
zooko at zooko.com
Fri Feb 27 12:16:39 PST 2009
On Feb 27, 2009, at 10:45 AM, Brian Warner wrote:
> [must be brief, typing on an iphone, I'll write more on Monday when
> I've got a real keyboard]
... he said before writing a note as thorough and detailed as most
programmers ever write.
> On the inbound side, if we can't decode the filename with the
> user's preferred encoding (which can default to utf-8, or utf-16 on
> windows, or something configured into python, etc)
Fortunately on Windows all filenames really are utf-16-encoded (or
UCS-2, or whatever encoding it is that the filesystem specifies), so
you'll never get a decode error, nor a silent misdecoding to random
gibberish. (Insert joke here about treating unix as a first class
citizen even though it doesn't deserve it.)
> then we pretend to decode it with Latin-1, so that a human looking
> at the mangled unicode name can hopefully guess what the proper
> name should have been. We use the unicode result as the childname.
> In all cases, we store the orginal bytestring in the metadata.
As I understand it from Shawn and Kevin, taking an arbitrary byte
string and decoding it with latin-1 to produce a unicode object is
lossless -- a subsequent encode of that unicode object with latin-1
will always yield the same bytes. Is that right?
In that case, we don't need the separate base32-encoded bytestring,
just the flag to say whether the child name element was the result of
a successful decode using the encoding declared by the filesystem, or
else the result of a "fallback" latin-1 decode.
This simpler approach *does* mean that we lose information whenever
there is a file which isn't *actually* encoded in the declared
encoding of the local filesystem, but which happens to decode when
you try. However, I'm not sure it is worth the complexity of
preserving the bytes of that file's name (which after all nobody else
can decode either except by guessing at encodings). Also, note that
almost certainly the local user examing that local filename with his
local tools will see the gibberish that results from decoding that
name with his local filesystem encoding, raising the question of what
"actually" actually means in the previous sentence.
So I propose Strategy 2.d (but who's counting?):
Decode the filename with the declared encoding. If that succeeds,
then put that unicode string (utf-8 encoded) into the child name and
set the flag "latin_1_fallback: False". If that fails then decode
the filename with latin-1 (which can't fail) then put that unicode
string (utf-8 encoded) into the child name and set the flag
"latin_1_fallback: True".
Now old tahoe clients (or lazy new ones), will just get the child
name bytes, utf-8 decode them to get a unicode string, and use it.
It will either be right, or it will be the gibberish that you get
from interpreting whatever-it-originally-was as latin-1.
New and diligent tahoe clients will check the "latin_1_fallback" flag
first. If it is False, they proceed as before, knowing that they're
getting the right name. If it is True, then they take the unicode
object (which they got by utf-8-decoding the child name bytes), and
they encode it with latin-1. This gives them back the original bytes
(right?). Now they do whatever diligent tahoe clients do with the
original bytes of a filename in an unknown encoding.
This seems simpler to me than your proposal, but I'm not sure if I
understood everything in your proposal, so I'm not sure if there is
something that this proposal wouldn't do as well.
Please everyone who understands this let me know if this would work.
Regards,
Zooko
More information about the tahoe-dev
mailing list