#1076 closed defect (fixed)

Unicode normalization needs to be applied to filenames in more cases

Reported by: davidsarah Owned by: zooko
Priority: major Milestone: 1.7.0
Component: code-dirnodes Version: 1.7β
Keywords: unicode sftp ftpd wui cli usability forward-compatibility reviewed Cc:
Launchpad Bug:

Description (last modified by davidsarah)

Currently, the CLI normalizes filenames to NFC when listing the contents of a local directory in listdir_unicode (used by tahoe cp and tahoe backup), but that is the only point at which filenames are normalized.

So, unnormalized filenames can get into Tahoe directories via CLI arguments, SFTP, FTP, and the WUI.

This is a forward-compatibility issue because, if we have any non-NFC filenames stored in Tahoe directories that we need to maintain compatibility with, then we would have to normalize when reading filenames out of Tahoe directories and not just when putting filenames into them.

Attachments (3)

nfc-normalization.dpatch (66.7 KB) - added by davidsarah at 2010-06-16T03:29:12Z.
Provisional patch to NFC-normalize filenames going in and out of Tahoe directories.
nfc-normalization-2.dpatch (86.6 KB) - added by davidsarah at 2010-06-17T04:31:09Z.
Patch bundle for normalization changes including tests. Also work around a bug in locale.getpreferredencoding, and add support for Unicode 'exclude' patterns in 'tahoe backup'.
nfc-normalization-3.dpatch (96.9 KB) - added by davidsarah at 2010-06-18T02:17:53Z.
Patch bundle for normalization changes including tests (and a new test for normalization of names coming out of a directory). Also work around a bug in locale.getpreferredencoding. Fixes a hold in the previous patch where childnames in directories created by modemaker.create_mutable/immutable_directory would not be normalized. Does not include the 'tahoe backup' change.

Download all attachments as: .zip

Change History (16)

comment:1 Changed at 2010-06-10T00:02:18Z by davidsarah

  • Description modified (diff)

comment:2 Changed at 2010-06-10T00:04:57Z by davidsarah

  • Component changed from code-frontend to code-dirnodes

It should probably be the dirnode implementation that enforces this, so that we are not having to normalize in multiple frontends.

comment:3 Changed at 2010-06-10T19:14:59Z by zooko

  • Milestone changed from undecided to 1.7.0

Since this is a potentially significant forward-compatibility issue and potentially significant bug, we're going to fix it for 1.7.0-final.

Changed at 2010-06-16T03:29:12Z by davidsarah

Provisional patch to NFC-normalize filenames going in and out of Tahoe directories.

comment:4 follow-up: Changed at 2010-06-16T03:44:31Z by davidsarah

nfc-normalization.dpatch also normalizes names to NFC when unpacking them from directories. This isn't absolutely necessary, but if a name contains characters that are unassigned in the version of Unicode used by the client that wrote the directory, then they might not be normalized wrt a later version of Unicode.

The patch does not remove the normalization from listdir_unicode in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:

  • if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to.
  • if the local filesystem API is not normalizing, then users of listdir_unicode such as tahoe cp and tahoe backup may get a 'file not found' error when they try to read the file by its normalized name.

The patch does not have any tests.

comment:5 in reply to: ↑ 4 ; follow-up: Changed at 2010-06-16T04:06:07Z by zooko

Replying to davidsarah:

The patch does not remove the normalization from listdir_unicode in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:

  • if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to.

Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory?

Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right?

comment:6 follow-ups: Changed at 2010-06-16T04:23:36Z by zooko

Review:

Otherwise, this looks like a good patch! Thank you!

comment:7 in reply to: ↑ 6 Changed at 2010-06-16T04:24:35Z by zooko

Replying to zooko:

  • Is it even possible to have a name containing characters that are unassigned in the current version of unicode? Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158
  • In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case.

Oh, this method of testing also suggests a reason why we need the code: because releases of Tahoe-LAFS < v1.7 might put non-normalized names into directories.

comment:8 in reply to: ↑ 6 ; follow-up: Changed at 2010-06-16T05:05:26Z by davidsarah

Replying to zooko:

Review:

  • Please write down into some doc why we chose NFC for our normalization scheme.

Will do (probably in webapi.txt).

It's quite difficult to avoid all possible cases of double-normalization without breaking abstraction. (You would have to add another method that assumed its argument was already normalized, and ensure that assumption was always met.)

  • Is it even possible to have a name containing characters that are unassigned in the current version of unicode?

Yes, we don't check for unassigned characters.

Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158

<http://unicode.org/policies/stability_policy.html>, see the note in section 'Normalization Stability'.

  • In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case.

Right. We can also check that we handle other misencoded directory contents that way (which is a test that was left undone in 1.6.0 and .1).

comment:9 in reply to: ↑ 5 Changed at 2010-06-16T05:06:43Z by davidsarah

Replying to zooko:

Replying to davidsarah:

The patch does not remove the normalization from listdir_unicode in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:

  • if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to.

Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory?

Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right?

Right.

comment:10 in reply to: ↑ 8 Changed at 2010-06-16T16:47:18Z by zooko

Replying to davidsarah:

Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158

<http://unicode.org/policies/stability_policy.html>, see the note in section 'Normalization Stability'.

Please put this reference into the comments or docs somewhere. Thanks!

Changed at 2010-06-17T04:31:09Z by davidsarah

Patch bundle for normalization changes including tests. Also work around a bug in locale.getpreferredencoding, and add support for Unicode 'exclude' patterns in 'tahoe backup'.

comment:11 Changed at 2010-06-17T05:14:31Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned

Too tired. Will review tomorrow morning on the bus to work. I hope there are or will be tests for these new things: "work around a bug in locale.getpreferredencoding" and "add support for Unicode 'exclude' patterns in 'tahoe backup'"...

Changed at 2010-06-18T02:17:53Z by davidsarah

Patch bundle for normalization changes including tests (and a new test for normalization of names coming out of a directory). Also work around a bug in locale.getpreferredencoding. Fixes a hold in the previous patch where childnames in directories created by modemaker.create_mutable/immutable_directory would not be normalized. Does not include the 'tahoe backup' change.

comment:12 Changed at 2010-06-18T05:14:33Z by zooko

  • Keywords reviewed added
  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.