[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue
tahoe-lafs
trac at allmydata.org
Sun May 3 08:14:28 PDT 2009
#534: "tahoe cp" command encoding issue
-----------------------------------+----------------------------------------
Reporter: francois | Owner: francois
Type: defect | Status: assigned
Priority: minor | Milestone: 1.5.0
Component: code-frontend-cli | Version: 1.2.0
Resolution: | Keywords: cp encoding unicode filename utf-8
Launchpad_bug: |
-----------------------------------+----------------------------------------
Comment(by zooko):
Okay, my current design is at the end of this message, including
rationale:
http://allmydata.org/pipermail/tahoe-dev/2009-May/001670.html
Here is the summary:
To copy an entry from a local filesystem into Tahoe:
1. On Windows or Mac read the filename with the unicode APIs.
Normalize the string with filename = unicodedata.normalize('NFC',
filename). Leave the "original_bytes" key and the "failed_decode" flag
out of the metadata.
2. On Linux or Solaris read the filename with the string APIs, and
store the result in the "original_bytes" part of the metadata. Call
sys.getfilesystemencoding() to get an alleged_encoding. Then, call
bytes.decode(alleged_encoding, 'strict') to try to get a unicode
object.
2.a. If this decoding succeeds then normalize the unicode filename
with filename = unicodedata.normalize('NFC', filename), store the
resulting filename and leave the "failed_decode" flag out of the
metadata.
2.b. If this decoding fails, then we decode it again with
bytes.decode('latin-1', 'strict'). Do not normalize it. Store the
resulting unicode object into the "filename" part, set the
"failed_decode" flag to True. This is mojibake!
3. (handling collisions) In either case 2.a or 2.b the resulting
unicode string may already be present in the directory. If so, check
the failed_decode flags on the current entry and the new entry. If
they are both set or both unset then the new entry overwrites the old
entry -- they had the same name. If the failed_decode flags differ
then this is a case of collision -- the old entry and the new entry
had (as far as we are concerned) different names that accidentally
generated the same unicode. Alter the new entry's name, for example by
appending "~1" and then trying again and incrementing the number until
it doesn't match any extant entry.
To copy an entry from Tahoe into a local filesystem:
Always use the Python unicode API. The original_bytes field and the
failed_decode field in the metadata are not consulted.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/534#comment:66>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list