[tahoe-dev] Fwd: [Python-Dev] PEP 383 and GUI libraries
Zooko O'Whielacronx
zookog at gmail.com
Sat May 2 19:01:45 PDT 2009
Folks:
A contributor to the python-dev list who wishes to remain anonymous
wrote me this note about Tahoe's encoding strategy.
Regards,
Zooko
---------- Forwarded message ----------
From: anonymous
Date: Fri, May 1, 2009 at 10:54 PM
Subject: Re: [Python-Dev] PEP 383 and GUI libraries
To: Zooko O'Whielacronx <zookog at gmail.com>
> Tahoe is a backup and filesharing program, so you might for example,
> execute "tahoe cp -r Motörhead tahoe:" to copy all the contents of
> your "Motörhead" directory to your Tahoe filesystem. Later you or a
> friend, might execute "tahoe cp -r tahoe:Motörhead ." to copy
> everything from that directory within your Tahoe filesystem to your
> local filesystem. So in this case the flow of information is
> local_system_1 -> Tahoe -> local_system_2.
>
> The Requirement 1 is that for each filename encountered which is a
> valid encoding in local_system_1, then the resulting (unicode) name is
> transmitted through the Tahoe filesystem and then written out into
> local_system_2 in the expected way (i.e. just by using the Python
> unicode APIs and passing the unicode object to them).
>
> Requirement 2 is that for each filename encountered which is not a
> valid encoding in local_system_1, then the original bytes are
> transmitted through the Tahoe filesystem and then,
So, if someone sets up their file system encoding badly and copies
file names with a bad encoding to Tahoe, I mess up my file system when
I copy them back? But I only get punished in that way when I'm using
Linux; Windows users just get an error?
What happens with your encoding when someone manages to encode a file
name that contains ".." and "/", yet whatever string checking you do
on file names doesn't see that because you check in Unicode? (Ditto
if those strings go into the databases.)
> if the target
> system is a byte-oriented system such as Linux, the original bytes are
> written into the target filesystem. (If the target is not Linux then
> mojibake! but we don't have to go into that now.)
Linux file systems aren't "byte oriented"; they use whatever encoding
is configured in the environment and the mount options. In fact,
POSIX specifies that the only files you can rely on being able to
write to a POSIX file system contain A-Z, a-z, 0-9, ".", "-", and "_".
I'd get quite annoyed if I copy something from tahoe: and my nice,
consistent UTF-8 file system now contains bad encodings that other
Unicode software trips over. Yes, it can happen right now (cp, mv,
etc. don't check for historical reasons), but new software should be
better than that.
What happens when the target file system enforces constraints? Linux,
Mac, and Windows all have file systems that enforce UTF-8 (i.e., you
cannot write non-compliant strings).
You also need to think about where the source of the problem is and
how it can be fixed. If someone copies a badly encoded file name to
tahoe: and you don't give an error there and then, nobody will ever
know really what the problem was. A short, badly encoded file name
might have been Hebrew or Greek or Chinese in some encoding. The only
person who knows is the person copying the file into Tahoe. The right
place to do something about it is when the problem occurs:
$ tahoe cp -r Motörhead tahoe:
ERROR: bad encodings in source path names
copying files like this would make the repository inconsistent
use -F to fix this interactively, -H to do something automatically, -e
to select an encoding
$
-H tries UTF-8, then a bunch of encodings like iso8859-1,
windows-1252, etc. The last one will always work and result in
correctly encoded strings, but not necessarily readable ones
-F finds all the encodings in which the strings are valid and
sensible, then asks the user
-e enc selects another encoding and still checks (windows-1252 always
works, of course)
Adding a new quoting mechanism and putting bad encodings into a
network file system is dangerous.
The best thing to do for Tahoe is likely to define valid UTF-8 as the
encoding for all Tahoe file names and to raise an error when people
try to stick anything else in there.
Anon
PS: I'm not replying to the list because PEP 383 is in Python now.
However, please consider at least not introducing this badness into
network file systems.
More information about the tahoe-dev
mailing list