#731 new defect

what to do with filenames that are illegal on some systems

Reported by: zooko Owned by:
Priority: major Milestone: eventually
Component: code-dirnodes Version: 1.4.1
Keywords: forward-compatibility i18n unicode names Cc:
Launchpad Bug:

Description (last modified by zooko)

If someone copies a file from system A into Tahoe-LAFS and then later someone tries to copy that file from Tahoe-LAFS into system B, then a problem could arise if the filename from system A is illegal on system B. This can happen in a few ways:

  1. The filename could be illegal on Windows (http://msdn.microsoft.com/en-us/library/aa365247.aspx ), and system B could be Windows and system A non-Windows.
  1. The filename could be illegal on Mac (http://developer.apple.com/technotes/tn/tn1150table.html ).
  1. The filename could case-collide with another filename in the same directory, and system B could be a case-insensitive filesystem. (Note that Tahoe's current naïve approach will result in a randomly-chosen one of the files overwriting the other if the target system is Windows or Macintosh.)
  1. If we allowed undecodable bytestring filenames from POSIX system A's, either by storing bytestring (non-unicode) filenames, or by some escaping mechanism such as utf8b, then a non-POSIX system B would not be able to accept that name (or at least we should not write that name into that system). Likewise some users of POSIX have a policy that only correctly encoded unicode filenames should be stored in their filesystem, so for them we should not write that name even though we can do so by using the POSIX byte-oriented APIs.

Here are someone else's notes about these sorts of issues:

http://www.portfoliofaq.com/pfaq/FAQ00352.htm

See also David A. Wheeler's excellent article arguing that we should start being pickier about filenames in POSIX systems:

http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

There are various ways Tahoe can deal with this. It can do something about it on the Tahoe -> system B leg of the trip, such as by stopping with an error, offering to rename the offending files, etc.. It could also do something about it on the system A -> Tahoe leg of the trip.

I think in the short term it might be better if Tahoe rejected non-portable filenames in the system A -> Tahoe leg of the trip, because we don't yet know how we want to handle them. By rejecting them, we avoid the current random-overwrite issue and we don't constrain future versions of Tahoe-LAFS as much in terms of what sorts of filenames it has to support. (There might already be some problematic filenames stored in Tahoe and we might want to extend Tahoe to deal with these better in the future, but if Tahoe-v1.5 starts rejecting new ones then the problem will probably be less widespread and less severe in the future.)

On the other hand, rejecting them would be a UI/API regression, so we would probably want to add a --force-nonportable-filenames option to make it behave like Tahoe-v1.4 currently does.

Help!?

Change History (21)

comment:1 Changed at 2009-06-11T02:35:41Z by zooko

  • Keywords backwards-compatibility added

This is a "backwards-compatibility" issue. Doing the easy and lazy thing now could make things harder for future versions of Tahoe. Adding the "backwards-compatibility" Keyword and leaving this ticket in the "1.5.0" Milestone. Help!?

comment:2 Changed at 2009-06-11T19:53:01Z by zooko

  • Keywords forward-compatibility added; backwards-compatibility removed

I meant "forward-compatibility": pipermail/tahoe-dev/2009-June/001968.html

Last edited at 2014-08-12T16:19:30Z by zooko (previous) (diff)

comment:3 follow-up: Changed at 2009-06-14T15:57:53Z by bewst

A few notes:

  • My first reaction was to say you had the right idea in rejecting nonportable names, but then I thought about how it might affect me. Although rejecting nonportable names on the way in is "safe" from a design evolution, point of view, it probably won't make customers happy when their backup fails partway through because some file has a name tahoe didn't like. It'll also be a problem for some people if files that used to save just fine start producing error messages.
  • You might want to decide what "portable" means before trying to solve this problem. For example, are you planning to support VMS? That changes what it means to be a legal filename. One ambitious definition could be: works wherever Python works.
  • Many people have had to solve this sort of problem before you; this is one of those areas where you can benefit from their research, e.g. http://www.boost.org/doc/libs/1_39_0/libs/filesystem/doc/portability_guide.htm#recommendations.
  • FWIW, last I heard, Samba had given up on solving this problem correctly, though that may have changed.

It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case. In the other cases, it would probably be a good idea to provide a hook in the Python API for handling filenames that can't be represented, and when using the CLI, etc., there should be at least two options: translate the name via some encoding, with a warning, and cause a hard error.

My 2c.

comment:4 in reply to: ↑ 3 ; follow-up: Changed at 2009-06-14T16:49:38Z by swillden

Replying to bewst:

It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case.

This is my thought as well, at least for backup use cases. Tahoe in general has a broader usage model, and so solutions appropriate for backup may not be adequate for those other use cases, but for backups, I think the top priority is ensuring that backups succeed reliably and don't lose any data -- including file name data.

That's why the approach I've chosen for GridBackup (which, BTW, is finally starting to write to a grid, Yay!) is to make sure that:

  1. ALL names can be backed up, regardless of whether or not they make any sense on any filesystem in existence.
  1. When restoring to a system that uses the same encoding as the backup source, all names are restored byte-for-byte identically to what was read from the file system during backup.
  1. When restoring to a system that uses a different encoding, I try to transcode the names but just error out if it doesn't work. Eventually my plan is to give the user a list of paths that broke and let them decide what to name each of them, with some suggestions based on attempts to decode the name with all Python-supported codecs.

During a restore, there's room for human intervention to address naming problems, but during backup, I just want to get the data. I'm taking a similar approach to other metadata. Extended attributes, ACLs, resource forks, even POSIX permissions -- there are destination systems to which none of these things will make sense, but that's okay. The backup will grab everything and we can deal with how to make use of the data, if possible, during restore.

comment:5 in reply to: ↑ 4 Changed at 2009-06-15T09:39:12Z by bewst

Replying to swillden:

Replying to bewst:

It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case.

This is my thought as well, at least for backup use cases.

It's what I want for all the use cases I can think of, and especially so while GridBackup isn't ready for primetime.

comment:6 Changed at 2009-06-30T12:38:02Z by zooko

  • Milestone changed from 1.5.0 to 1.6.0

comment:7 Changed at 2009-11-23T02:24:22Z by davidsarah

  • Keywords i18n added

comment:8 Changed at 2009-12-03T18:11:12Z by zooko

  • Keywords unicode added

comment:9 Changed at 2010-01-26T15:44:07Z by zooko

  • Milestone changed from 1.6.0 to eventually

comment:10 Changed at 2010-01-27T06:01:13Z by zooko

  • Milestone changed from eventually to 1.7.0

comment:11 Changed at 2010-05-05T05:47:01Z by zooko

  • Milestone changed from 1.7.0 to eventually

I'm not going to do anything about this for v1.7.0. I still think the current behavior is problematic (there are normal, not-uncommon use cases where some files are unexpectedly overwritten and others where download/restore fails). But I don't have time to work on it for v1.7.0.

comment:12 Changed at 2010-05-18T21:20:50Z by zooko

I almost hesitate to mention this, because I'm not at all sure that it is a good idea, but with regard to problem 4. from the initial comment, we just try to autodetect the real encoding (if any) using this package I just discovered: http://chardet.feedparser.org/ . It is probably an even worse idea for filenames than for other strings, which can be short and non-linguistic (e.g. "f954b.c" is a reasonable filename for an English speaker to use but not a reasonable string to find in English prose a newspaper or web page.)

comment:13 Changed at 2010-06-21T03:13:34Z by davidsarah

  • Keywords names added

comment:14 Changed at 2010-07-14T06:30:26Z by zooko

(copying some comments that I wrote over on #1072...)

It is worth considering the five possible Requirements in this message. With our current unicode support as of Tahoe-LAFS v1.7.0 we have achieved Requirement 1 (unicode), Requirement 2 (faithful if unicode). We have not achieved Requirement 3 (no file left behind), Requirement 4 (faithful bytes if not unicide), or Requirement 5 (no loss of information).

Nowadays I am pretty skeptical of the value of Requirement 4.

After I wrote that message I subsequently realized that a good behavior would be that if you load an ill-encoded filename into Tahoe-LAFS then its representation looks identical to or similar to the representation of that file when you view it with Nautilus, GNU ls, or whatever other tools would have the same problem with ill-encoded filenames. I think this should be added as Requirement 6 (familiar gibberish): "If you copy an ill-encoded filename into Tahoe-LAFS, its filename looks identical to or similar to what you see when you view it with other tools (e.g. Nautilus, GNU ls, etc.)".

Version 0, edited at 2010-07-14T06:30:26Z by zooko (next)

comment:15 Changed at 2011-07-21T18:26:12Z by zooko

  • Description modified (diff)

comment:16 Changed at 2011-10-27T16:46:49Z by zooko

Here are some more notes from someone else about these sorts of surprises: http://www.ericsink.com/entries/quirky.html

comment:17 follow-up: Changed at 2012-01-09T17:54:13Z by zooko

stringprep (RFC 3454) seems like a useful standard:

http://www.ietf.org/rfc/rfc3454.txt

And it is implemented in the Python standard library:

http://docs.python.org/library/stringprep.html

Here is monotone's rules about filename handling:

http://www.monotone.ca/docs/Internationalization.html

comment:18 in reply to: ↑ 17 Changed at 2012-01-09T20:04:22Z by davidsarah

Replying to zooko:

stringprep (RFC 3454) seems like a useful standard:

http://www.ietf.org/rfc/rfc3454.txt

stringprep is one of the worst ideas ever to come out of an IETF Working Group.

Unicode is a semantic character encoding standard; that is, it makes a valiant attempt to unify or disunify characters based on distinctions in meaning and usage, as opposed to visual appearance. A simple example of this is that Latin 'p' looks identical to Cyrillic 'р', but they are completely different letters that don't even sound the same. Some people might consider that to be a problem, but actually it's just a fact about human scripts.

The International Domain Names Working Group got a bee in their bonnet about it being a problem that some characters are "confusingly" similar. Now, given that some commonly used characters are semantically distinct but look identical in related fonts, you might think it to be a quixotic task to somehow deal with the tens of thousands of characters that only look similar to some other character, but that didn't stop the WG arguing about it interminably, and coming up with stringprep in order to placate the people on one side of the argument -- even though stringprep doesn't really solve that issue at all.

There are indeed some characters, I call them "junk characters", that we don't want to use. The polite term for junk characters is "compatibility characters", most of which are "compatibility composites" as defined in section 2.3 of the Unicode Standard. These characters are only in Unicode because some national body insisted on round-tripping between Unicode and their misdesigned legacy standard (which could have been done in other ways that would have been more technically elegant than assigning many ad-hoc character variants, but that's water under the bridge).

The right place to implement "don't use junk characters" is in input methods. That is, if a user can never type a junk character, then it's much less likely that its existence will cause a problem. More specifically, if a user can only type non-junk characters in some normalization form (preferably NFC), then name lookups based on exact matching, as needed for filenames and other identifiers, are more likely to work.

The wrong thing to do is what stringprep tries to do, which is to map junk characters to somebody's idea of the nearest non-junk characters. This just causes unintended name collisions and breakage, and doesn't get any closer to solving the unsolvable issue of confusable characters.

comment:19 Changed at 2012-01-09T21:41:11Z by gdt

Before we dig into this hard, what is special about tahoe, compared to the other 12 distributed filesystems out there, and what problem do we have that they don't, and why do their approaches not map?

comment:21 Changed at 2016-03-01T15:11:28Z by zooko

See also #1840

Note: See TracTickets for help on using tickets.