[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue
tahoe-lafs
trac at allmydata.org
Tue Apr 28 11:00:54 PDT 2009
#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
Reporter: francois | Owner: francois
Type: defect | Status: assigned
Priority: minor | Milestone: 1.5.0
Component: code-frontend-cli | Version: 1.2.0
Resolution: | Keywords: cp encoding unicode filename utf-8
Launchpad_bug: |
-----------------------------------+----------------------------------------
Comment(by zooko):
I wrote: """Using utf-8b to store bytes from a failed decoding instead of
iso-8859-1 means ... that we can omit the "failed_decode" flag, because it
makes no difference whether the filename was originally alleged to be in
koi8-r, but failed to decode using the koi8-r codec, and so was instead
decoded using utf-8b, or whether the filename was originally alleged to be
in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's
right.)"""
Oh no, I am wrong about this because of the existence of byte-oriented
systems where the filesystem encoding is not {{{utf-8}}}. When outputting
a filename into such a system, you ought to check the "failed_decode" flag
and, if it is set, reconstitute the original bytes before proceeding to
emit the name using the byte-oriented API.
Here is some code which attempts to explain what I mean. It doesn't
actually run -- for example it is missing its {{{import}}} statements --
but writing it helped me think this through a little more:
{{{
# A wrapper around the Python Standard Library's filename access functions
to
# provide a uniform API for all platforms and to prevent lossy en/de-
coding.
class Fname:
def __init__(self, name, failed_decode=False, alleged_encoding=None):
self.name = name
self.failed_decode = failed_decode
self.alleged_encoding = alleged_encoding
if platform.system() in ('Linux', 'Solaris'):
# on byte-oriented filesystems, such as Linux and Solaris
def unicode_to_fs(fn):
""" Encode an unicode object to bytes. """
precondition(isinstance(fn, Fname), fn)
precondition(isinstance(fn.name, unicode), fn.name)
if fn.failed_decode:
# This means that the unicode string in .name is not actually
the
# result of a successful decoding with a suggested codec, but
is
# instead the result of stuffing the bytes into a unicode by
dint
# of the utf-8b trick. This means that on a byte-oriented
system,
# you shouldn't treat the .name as a unicode string containing
# chars, but instead you should get the original bytes back
out of
# it.
return fn.name.encode('utf-8b', 'python-replace')
else:
fsencoding = sys.getfilesystemencoding()
if fsencoding in (None, '', 'ascii', 'utf-8'):
fsencoding = 'utf-8b'
try:
return fn.name.encode(encoding, 'python-escape')
except UnicodeEncodeError:
raise usage.UsageError("Filename '%s' cannot be encoded
using \
the current encoding of your filesystem (%s). Please configure your locale
\
correctly or rename this file." % (s, sys.getfilesystemencoding()))
def fs_to_unicode(bytesfn):
""" Decode bytes from the filesystem to a unicode object. """
precondition(isinstance(bytesfn, str), str)
alleged_encoding = sys.getfilesystemencoding()
if alleged_encoding in (None, '', 'ascii', 'utf-8'):
alleged_encoding = 'utf-8b'
try:
unicodefn = bytesfn.decode(alleged_encoding, 'strict')
except UnicodeEncodeError:
unicodefn = bytesfn.decode('utf-8b', 'python-escape')
return Fname(unicodefn)
else:
unicodefn = unicodedata.normalize('NFC', unicodefn)
if alleged_encoding == 'utf-8b':
return Fname(unicodefn)
else:
return Fname(unicodefn, alleged_encoding)
def listdir(fn):
assert isinstance(fn, Fname), fn
assert isinstance(fn.name, unicode), fn.name
bytesfn = unicode_to_fs(fn.name)
res = os.listdir(bytesfn)
return([fs_to_unicode(fn) for fn in res])
else:
# on unicode-oriented filesystems, such as Mac and Windows
def listdir(fn):
assert isinstance(fn, Fname), fn
assert isinstance(fn.name, unicode), fn.name
return [Fname(n) for n in os.listdir(fn.name)]
}}}
Also attached to this ticket...
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:59>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list