Opened at 2009-06-09T21:19:25Z
Last modified at 2016-03-01T15:11:28Z
#731 new defect
what to do with filenames that are illegal on some systems — at Version 15
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-dirnodes | Version: | 1.4.1 |
Keywords: | forward-compatibility i18n unicode names | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
If someone copies a file from system A into Tahoe-LAFS and then later someone tries to copy that file from Tahoe-LAFS into system B, then a problem could arise if the filename from system A is illegal on system B. This can happen in a few ways:
- The filename could be illegal on Windows (http://msdn.microsoft.com/en-us/library/aa365247.aspx ), and system B could be Windows and system A non-Windows.
- The filename could be illegal on Mac (http://developer.apple.com/technotes/tn/tn1150table.html ).
- The filename could case-collide with another filename in the same directory, and system B could be a case-insensitive filesystem. (Note that Tahoe's current naïve approach will result in a randomly-chosen one of the files overwriting the other if the target system is Windows or Macintosh.)
- If we allowed undecodable bytestring filenames from POSIX system A's, either by storing bytestring (non-unicode) filenames, or by some escaping mechanism such as utf8b, then a non-POSIX system B would not be able to accept that name (or at least we should not write that name into that system). Likewise some users of POSIX have a policy that only correctly encoded unicode filenames should be stored in their filesystem, so for them we should not write that name even though we can do so by using the POSIX byte-oriented APIs.
Here are someone else's notes about these sorts of issues:
http://www.portfoliofaq.com/pfaq/FAQ00352.htm
See also David A. Wheeler's excellent article arguing that we should start being pickier about filenames in POSIX systems:
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
There are various ways Tahoe can deal with this. It can do something about it on the Tahoe -> system B leg of the trip, such as by stopping with an error, offering to rename the offending files, etc.. It could also do something about it on the system A -> Tahoe leg of the trip.
I think in the short term it might be better if Tahoe rejected non-portable filenames in the system A -> Tahoe leg of the trip, because we don't yet know how we want to handle them. By rejecting them, we avoid the current random-overwrite issue and we don't constrain future versions of Tahoe-LAFS as much in terms of what sorts of filenames it has to support. (There might already be some problematic filenames stored in Tahoe and we might want to extend Tahoe to deal with these better in the future, but if Tahoe-v1.5 starts rejecting new ones then the problem will probably be less widespread and less severe in the future.)
On the other hand, rejecting them would be a UI/API regression, so we would probably want to add a --force-nonportable-filenames option to make it behave like Tahoe-v1.4 currently does.
Help!?
Change History (15)
comment:1 Changed at 2009-06-11T02:35:41Z by zooko
- Keywords backwards-compatibility added
comment:2 Changed at 2009-06-11T19:53:01Z by zooko
- Keywords forward-compatibility added; backwards-compatibility removed
I meant "forward-compatibility": pipermail/tahoe-dev/2009-June/001968.html
comment:3 follow-up: ↓ 4 Changed at 2009-06-14T15:57:53Z by bewst
A few notes:
- My first reaction was to say you had the right idea in rejecting nonportable names, but then I thought about how it might affect me. Although rejecting nonportable names on the way in is "safe" from a design evolution, point of view, it probably won't make customers happy when their backup fails partway through because some file has a name tahoe didn't like. It'll also be a problem for some people if files that used to save just fine start producing error messages.
- You might want to decide what "portable" means before trying to solve this problem. For example, are you planning to support VMS? That changes what it means to be a legal filename. One ambitious definition could be: works wherever Python works.
- Many people have had to solve this sort of problem before you; this is one of those areas where you can benefit from their research, e.g. http://www.boost.org/doc/libs/1_39_0/libs/filesystem/doc/portability_guide.htm#recommendations.
- FWIW, last I heard, Samba had given up on solving this problem correctly, though that may have changed.
It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case. In the other cases, it would probably be a good idea to provide a hook in the Python API for handling filenames that can't be represented, and when using the CLI, etc., there should be at least two options: translate the name via some encoding, with a warning, and cause a hard error.
My 2c.
comment:4 in reply to: ↑ 3 ; follow-up: ↓ 5 Changed at 2009-06-14T16:49:38Z by swillden
Replying to bewst:
It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case.
This is my thought as well, at least for backup use cases. Tahoe in general has a broader usage model, and so solutions appropriate for backup may not be adequate for those other use cases, but for backups, I think the top priority is ensuring that backups succeed reliably and don't lose any data -- including file name data.
That's why the approach I've chosen for GridBackup (which, BTW, is finally starting to write to a grid, Yay!) is to make sure that:
- ALL names can be backed up, regardless of whether or not they make any sense on any filesystem in existence.
- When restoring to a system that uses the same encoding as the backup source, all names are restored byte-for-byte identically to what was read from the file system during backup.
- When restoring to a system that uses a different encoding, I try to transcode the names but just error out if it doesn't work. Eventually my plan is to give the user a list of paths that broke and let them decide what to name each of them, with some suggestions based on attempts to decode the name with all Python-supported codecs.
During a restore, there's room for human intervention to address naming problems, but during backup, I just want to get the data. I'm taking a similar approach to other metadata. Extended attributes, ACLs, resource forks, even POSIX permissions -- there are destination systems to which none of these things will make sense, but that's okay. The backup will grab everything and we can deal with how to make use of the data, if possible, during restore.
comment:5 in reply to: ↑ 4 Changed at 2009-06-15T09:39:12Z by bewst
Replying to swillden:
Replying to bewst:
It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case.
This is my thought as well, at least for backup use cases.
It's what I want for all the use cases I can think of, and especially so while GridBackup isn't ready for primetime.
comment:6 Changed at 2009-06-30T12:38:02Z by zooko
- Milestone changed from 1.5.0 to 1.6.0
comment:7 Changed at 2009-11-23T02:24:22Z by davidsarah
- Keywords i18n added
comment:8 Changed at 2009-12-03T18:11:12Z by zooko
- Keywords unicode added
comment:9 Changed at 2010-01-26T15:44:07Z by zooko
- Milestone changed from 1.6.0 to eventually
comment:10 Changed at 2010-01-27T06:01:13Z by zooko
- Milestone changed from eventually to 1.7.0
comment:11 Changed at 2010-05-05T05:47:01Z by zooko
- Milestone changed from 1.7.0 to eventually
I'm not going to do anything about this for v1.7.0. I still think the current behavior is problematic (there are normal, not-uncommon use cases where some files are unexpectedly overwritten and others where download/restore fails). But I don't have time to work on it for v1.7.0.
comment:12 Changed at 2010-05-18T21:20:50Z by zooko
I almost hesitate to mention this, because I'm not at all sure that it is a good idea, but with regard to problem 4. from the initial comment, we just try to autodetect the real encoding (if any) using this package I just discovered: http://chardet.feedparser.org/ . It is probably an even worse idea for filenames than for other strings, which can be short and non-linguistic (e.g. "f954b.c" is a reasonable filename for an English speaker to use but not a reasonable string to find in English prose a newspaper or web page.)
comment:13 Changed at 2010-06-21T03:13:34Z by davidsarah
- Keywords names added
comment:14 Changed at 2010-07-14T06:30:26Z by zooko
(copying some comments that I wrote over on #1072...)
It is worth considering the five possible Requirements in this message. With our current unicode support as of Tahoe-LAFS v1.7.0 we have achieved Requirement 1 (unicode) and Requirement 2 (faithful if unicode). We have not achieved Requirement 3 (no file left behind), Requirement 4 (faithful bytes if not unicode), or Requirement 5 (no loss of information).
Nowadays I am pretty skeptical of the value of Requirement 4.
After I wrote that message I subsequently realized that a good behavior would be that if you load an ill-encoded filename into Tahoe-LAFS then its representation looks identical to or similar to the representation of that file when you view it with Nautilus, GNU ls, or whatever other tools would have the same problem with ill-encoded filenames. I think this should be added as Requirement 6 (familiar gibberish): "If you copy an ill-encoded filename into Tahoe-LAFS, its filename looks identical to or similar to what you see when you view it with other tools (e.g. Nautilus, GNU ls, etc.)".
comment:15 Changed at 2011-07-21T18:26:12Z by zooko
- Description modified (diff)
This is a "backwards-compatibility" issue. Doing the easy and lazy thing now could make things harder for future versions of Tahoe. Adding the "backwards-compatibility" Keyword and leaving this ticket in the "1.5.0" Milestone. Help!?