[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

tahoe-lafs trac at allmydata.org
Sun May 3 08:14:28 PDT 2009


#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
     Reporter:  francois           |       Owner:  francois                          
         Type:  defect             |      Status:  assigned                          
     Priority:  minor              |   Milestone:  1.5.0                             
    Component:  code-frontend-cli  |     Version:  1.2.0                             
   Resolution:                     |    Keywords:  cp encoding unicode filename utf-8
Launchpad_bug:                     |  
-----------------------------------+----------------------------------------

Comment(by zooko):

 Okay, my current design is at the end of this message, including
 rationale:

 http://allmydata.org/pipermail/tahoe-dev/2009-May/001670.html

 Here is the summary:

 To copy an entry from a local filesystem into Tahoe:

 1. On Windows or Mac read the filename with the unicode APIs.
 Normalize the string with filename = unicodedata.normalize('NFC',
 filename). Leave the "original_bytes" key and the "failed_decode" flag
 out of the metadata.

 2. On Linux or Solaris read the filename with the string APIs, and
 store the result in the "original_bytes" part of the metadata. Call
 sys.getfilesystemencoding() to get an alleged_encoding. Then, call
 bytes.decode(alleged_encoding, 'strict') to try to get a unicode
 object.

 2.a. If this decoding succeeds then normalize the unicode filename
 with filename = unicodedata.normalize('NFC', filename), store the
 resulting filename and leave the "failed_decode" flag out of the
 metadata.

 2.b. If this decoding fails, then we decode it again with
 bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
 resulting unicode object into the "filename" part, set the
 "failed_decode" flag to True. This is mojibake!

 3. (handling collisions)  In either case 2.a or 2.b the resulting
 unicode string may already be present in the directory. If so, check
 the failed_decode flags on the current entry and the new entry. If
 they are both set or both unset then the new entry overwrites the old
 entry -- they had the same name. If the failed_decode flags differ
 then this is a case of collision -- the old entry and the new entry
 had (as far as we are concerned) different names that accidentally
 generated the same unicode. Alter the new entry's name, for example by
 appending "~1" and then trying again and incrementing the number until
 it doesn't match any extant entry.

 To copy an entry from Tahoe into a local filesystem:

 Always use the Python unicode API. The original_bytes field and the
 failed_decode field in the metadata are not consulted.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:66>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid


More information about the tahoe-dev mailing list