[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

Tue Apr 28 11:00:54 PDT 2009

#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
     Reporter:  francois           |       Owner:  francois                          
         Type:  defect             |      Status:  assigned                          
     Priority:  minor              |   Milestone:  1.5.0                             
    Component:  code-frontend-cli  |     Version:  1.2.0                             
   Resolution:                     |    Keywords:  cp encoding unicode filename utf-8
Launchpad_bug:                     |  
-----------------------------------+----------------------------------------

Comment(by zooko):

 I wrote: """Using utf-8b to store bytes from a failed decoding instead of
 iso-8859-1 means ... that we can omit the "failed_decode" flag, because it
 makes no difference whether the filename was originally alleged to be in
 koi8-r, but failed to decode using the koi8-r codec, and so was instead
 decoded using utf-8b, or whether the filename was originally alleged to be
 in ascii or utf-8, and was decoded using utf-8b. (Right? I think that's
 right.)"""

 Oh no, I am wrong about this because of the existence of byte-oriented
 systems where the filesystem encoding is not {{{utf-8}}}.  When outputting
 a filename into such a system, you ought to check the "failed_decode" flag
 and, if it is set, reconstitute the original bytes before proceeding to
 emit the name using the byte-oriented API.

 Here is some code which attempts to explain what I mean.  It doesn't
 actually run -- for example it is missing its {{{import}}} statements --
 but writing it helped me think this through a little more:

 {{{
 # A wrapper around the Python Standard Library's filename access functions
 to
 # provide a uniform API for all platforms and to prevent lossy en/de-
 coding.

 class Fname:
     def __init__(self, name, failed_decode=False, alleged_encoding=None):
         self.name = name
         self.failed_decode = failed_decode
         self.alleged_encoding = alleged_encoding

 if platform.system() in ('Linux', 'Solaris'):
     # on byte-oriented filesystems, such as Linux and Solaris

     def unicode_to_fs(fn):
         """ Encode an unicode object to bytes. """
         precondition(isinstance(fn, Fname), fn)
         precondition(isinstance(fn.name, unicode), fn.name)

         if fn.failed_decode:
             # This means that the unicode string in .name is not actually
 the
             # result of a successful decoding with a suggested codec, but
 is
             # instead the result of stuffing the bytes into a unicode by
 dint
             # of the utf-8b trick.  This means that on a byte-oriented
 system,
             # you shouldn't treat the .name as a unicode string containing
             # chars, but instead you should get the original bytes back
 out of
             # it.
             return fn.name.encode('utf-8b', 'python-replace')
         else:
             fsencoding = sys.getfilesystemencoding()
             if fsencoding in (None, '', 'ascii', 'utf-8'):
                 fsencoding = 'utf-8b'
             try:
                 return fn.name.encode(encoding, 'python-escape')
             except UnicodeEncodeError:
                 raise usage.UsageError("Filename '%s' cannot be encoded
 using  \
 the current encoding of your filesystem (%s). Please configure your locale
 \
 correctly or rename this file." % (s, sys.getfilesystemencoding()))

     def fs_to_unicode(bytesfn):
         """ Decode bytes from the filesystem to a unicode object. """
         precondition(isinstance(bytesfn, str), str)

         alleged_encoding = sys.getfilesystemencoding()
         if alleged_encoding in (None, '', 'ascii', 'utf-8'):
             alleged_encoding = 'utf-8b'

         try:
             unicodefn = bytesfn.decode(alleged_encoding, 'strict')
         except UnicodeEncodeError:
             unicodefn = bytesfn.decode('utf-8b', 'python-escape')
             return Fname(unicodefn)
         else:
             unicodefn = unicodedata.normalize('NFC', unicodefn)
             if alleged_encoding == 'utf-8b':
                 return Fname(unicodefn)
             else:
                 return Fname(unicodefn, alleged_encoding)

     def listdir(fn):
         assert isinstance(fn, Fname), fn
         assert isinstance(fn.name, unicode), fn.name
         bytesfn = unicode_to_fs(fn.name)
         res = os.listdir(bytesfn)
         return([fs_to_unicode(fn) for fn in res])

 else:
     # on unicode-oriented filesystems, such as Mac and Windows
     def listdir(fn):
         assert isinstance(fn, Fname), fn
         assert isinstance(fn.name, unicode), fn.name
         return [Fname(n) for n in os.listdir(fn.name)]
 }}}

 Also attached to this ticket...

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:59>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid