#1072 closed enhancement (fixed)

rename stringutils.py to encodingutil.py and/or move contents into fileutil.py

Reported by: zooko Owned by: zooko
Priority: minor Milestone: 1.7.1
Component: code Version: 1.6.1
Keywords: unicode cleanup easy reviewed Cc:
Launchpad Bug:

Description

See also #47 (use pyutil as a separate package and contribute src/allmydata/util/* into pyutil)

Attachments (2)

rename-stringutils-drop-listdir_unicode-and-open_unicode.dpatch (32.1 KB) - added by davidsarah at 2010-07-12T00:51:52Z.
Rename stringutils to encodingutil, and drop listdir_unicode and open_unicode (since the Python stdlib functions work fine with Unicode paths). Also move some utility functions to fileutil.
rename-stringutils-drop-open_unicode.dpatch (30.7 KB) - added by davidsarah at 2010-07-13T04:44:46Z.
Rename stringutils to encodingutil, and drop open_unicode (since the Python 'open' function works fine with Unicode paths).

Download all attachments as: .zip

Change History (21)

comment:1 Changed at 2010-06-21T03:15:07Z by davidsarah

  • Milestone changed from undecided to 1.7.1
  • Owner changed from somebody to davidsarah
  • Status changed from new to assigned

comment:2 Changed at 2010-06-21T03:15:32Z by davidsarah

  • Keywords easy added

Changed at 2010-07-12T00:51:52Z by davidsarah

Rename stringutils to encodingutil, and drop listdir_unicode and open_unicode (since the Python stdlib functions work fine with Unicode paths). Also move some utility functions to fileutil.

comment:3 follow-up: Changed at 2010-07-12T00:54:43Z by davidsarah

  • Keywords review-needed added

The patch bundle is dependent on one of the patches for #1051, because that added a reference to stringutils that needed to be renamed.

comment:4 Changed at 2010-07-12T00:54:57Z by davidsarah

  • Owner changed from davidsarah to somebody
  • Status changed from assigned to new

comment:5 Changed at 2010-07-12T00:55:40Z by davidsarah

  • Summary changed from rename stringutils.py to unicodeutil.py and/or move contents into fileutil.py to rename stringutils.py to encodingutil.py and/or move contents into fileutil.py

comment:6 follow-up: Changed at 2010-07-12T04:38:26Z by zooko

I would have done the renaming with darcs replace. That way if there is a different patch that adds a use of stringutils then when it is combined with this patch it will automatically be changed ("commuted") to use encodingutil instead.

I'm pretty skeptical of the part about dropping listdir_unicode(). Did you confirm that the builtin os.listdir() passes the unit tests that François wrote for listdir_unicode()? If os.listdir() does pass those tests then I think this shows a hole in the tests. :-)

os.listdir(someunicodeobj) is specified to return plain str containing the bytes of a filename if the filename doesn't decode correctly with the sys.getfilesystemencoding(). That's probably not what we want, and in any case it is definitely not what listdir_unicode() does.

My summary of the behavior of os.listdir() is at the end of this letter:

http://tahoe-lafs.org/pipermail/tahoe-dev/2009-March/001379.html

(Note that in Python 3 os.listdir() is changed to behave in a way that is, in my humble opinion, even worse... But nevermind Python 3 for now.)

Here is my latest and greatest idea about how Tahoe-LAFS could handle ill-encoded filenames in a byte-oriented filesystem (i.e. in Unix not Mac OS X):

http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001670.html

It is worth considering the five possible Requirements in that message. With our current unicode support as of Tahoe-LAFS v1.7.0 we have achieved Requirement 1 (unicode), Requirement 2 (faithful if unicode). We have not achieved Requirement 3 (no file left behind), Requirement 4 (faithful bytes if not unicide), or Requirement 5 (no loss of information).

Nowadays I am pretty skeptical of the value of Requirement 4.

comment:7 Changed at 2010-07-12T04:40:01Z by zooko

  • Keywords review-needed removed
  • Owner changed from somebody to davidsarah

P.S. Of course I don't really think we should try to get any more of those Requirements satisfied in v1.7.1! Even if we could do it in time, our users don't expect shiny new improvements in their point releases, just bugfixes. :-)

comment:8 Changed at 2010-07-12T04:43:53Z by zooko

Oh sorry, the mailing list message that I linked to in comment:6 as my latest and greatest idea is not actually my latest and greatest. After I wrote that message I subsequently realized that a good behavior would be that if you load an ill-encoded filename into Tahoe-LAFS then its representation looks identical to or similar to the representation of that file when you view it with Nautilus, GNU ls, or whatever other tools would have the same problem with ill-encoded filenames. I think this should be added as Requirement 6 (familiar gibberish): "If you copy an ill-encoded filename into Tahoe-LAFS, its filename looks identical to or similar to what you see when you view it with other tools (e.g. Nautilus, GNU ls, etc.)".

comment:9 in reply to: ↑ 6 ; follow-up: Changed at 2010-07-13T04:38:25Z by davidsarah

Replying to zooko:

I would have done the renaming with darcs replace.

Sorry, don't trust darcs replace. I prefer to do replaces manually.

That way if there is a different patch that adds a use of stringutils then when it is combined with this patch it will automatically be changed ("commuted") to use encodingutil instead.

s/automatically be changed/scarily be mangled/g :-)

I'm pretty skeptical of the part about dropping listdir_unicode(). Did you confirm that the builtin os.listdir() passes the unit tests that François wrote for listdir_unicode()?

Those are tests of how listdir_unicode() is implemented in terms of os.listdir, rather than its functional behaviour.

os.listdir(someunicodeobj) is specified to return plain str containing the bytes of a filename if the filename doesn't decode correctly with the sys.getfilesystemencoding(). That's probably not what we want, and in any case it is definitely not what listdir_unicode() does.

Ah, I hadn't realized it did that. You're right, we can't drop it in that case. I will revert those changes.

Discussion of ill-encoded filenames more generally should go in ticket #731.

Last edited at 2010-07-13T04:45:55Z by davidsarah (previous) (diff)

Changed at 2010-07-13T04:44:46Z by davidsarah

Rename stringutils to encodingutil, and drop open_unicode (since the Python 'open' function works fine with Unicode paths).

comment:10 Changed at 2010-07-13T04:45:20Z by davidsarah

  • Keywords review-needed added
  • Owner changed from davidsarah to zooko

comment:11 Changed at 2010-07-13T05:50:27Z by zooko

  • Status changed from new to assigned

comment:12 in reply to: ↑ 9 Changed at 2010-07-14T07:16:14Z by zooko

Replying to davidsarah:

Did you confirm that the builtin os.listdir() passes the unit tests that François wrote for listdir_unicode()?

Those are tests of how listdir_unicode() is implemented in terms of os.listdir, rather than its functional behaviour.

I don't understand. For example this test: test_listdir_unicode(). Wouldn't it have noticed that the listdir function was failing to raise error on an undecodable entry (when the mock_getfilesystemencoding was set to 'ascii')? Wouldn't that have shown that your patch was breaking something?

comment:13 Changed at 2010-07-14T07:29:21Z by zooko

  • Keywords review-needed removed

comment:14 Changed at 2010-07-14T07:29:29Z by zooko

  • Keywords reviewed added

comment:15 in reply to: ↑ 3 Changed at 2010-07-14T14:13:36Z by zooko

Replying to davidsarah:

The patch bundle is dependent on one of the patches for #1051, because that added a reference to stringutils that needed to be renamed.

Interestingly, if you used darcs replace then this wouldn't depend on that one. I'm not sure whether that would be better or worse. :-)

comment:16 Changed at 2010-07-17T06:05:57Z by davidsarah

  • Milestone changed from 1.7.1 to 1.8β

comment:17 Changed at 2010-07-17T22:40:02Z by davidsarah

  • Milestone changed from 1.8β to 1.7.1
  • Resolution set to fixed
  • Status changed from assigned to closed

Applied in 11077ea74de4d59a. (Was that intentional?)

comment:18 Changed at 2010-07-18T02:00:04Z by davidsarah

It wasn't intentional, but we decided to commit this for 1.7.1 anyway.

The version that was applied was the older rename-stringutils-drop-listdir_unicode-and-open_unicode.dpatch. a8161c915a30e18c updates this to the equivalent of the rename-stringutils-drop-open_unicode.dpatch that zooko had reviewed.

comment:19 Changed at 2010-07-18T03:50:56Z by davidsarah

This caused test failures on some platforms, which were fixed in bdb10553eb4a461c.

Note: See TracTickets for help on using tickets.