[tahoe-dev] [tahoe-lafs] #629: 'tahoe backup' doesn't tolerate 8-bit filenames
Shawn Willden
shawn-tahoe at willden.org
Sun May 24 09:03:40 PDT 2009
On Saturday 23 May 2009 03:14:18 pm tahoe-lafs wrote:
> Why not take sqlite's advice and store only unicode strings in the db?
> This would be consistent with the direction that we're going on the
> {{{tahoe cp}}} issues: basically we're not going to support people who
> want to have ill-encoded filenames or a wrongly configured locale and who
> want their filenames to make round trips unchanged. I don't think that
> class of user is big enough (is it even non-null?) or their demands are
> reasonable enough that we should keep trying to satisfy them as well as
> satisfying people who have well-encoded filenames.
Not entirely related, but I'm (finally!) getting back to work on my GridBackup
app, and I think I've decided to simplify my approach to filename encoding
issues.
I've decided that since the common case is that files will be restored to a
system with the same encoding they were backed up from, it doesn't make sense
to put a great deal of effort into converting everything into proper Unicode
to support cross-platform restores.
So, my intent is to store pathnames byte-for-byte identically to how they're
stored in the file system. Since I need to JSON-encode them, though, I do
need to convert them to Unicode. I don't need the conversion to be correct,
though, just reliable. There's also the issue of Windows to deal with.
For systems that may contain invalid or mixed encodings (i.e. not Windows),
I'm reading raw strings and decoding them with latin1 to generate Unicode
(regardless of what the file system encoding is). The restore process can
encode the Unicode with latin1 to recover the raw strings and write them.
For systems that cannot contain invalid encodings (i.e. Windows), I'll get
Unicode from the FS and won't have to do anything to it before putting it in
JSON.
To distinguish between the two, I'll store a "decoded-with" field which will
be either latin1 or None. This will indicate what the restore process needs
to do to convert the Unicode to an appropriate raw or Unicode string,
assuming it's being restored to a file system using the same encoding as the
source.
To tell whether or not the destination file system uses the same encoding as
the source, I'll also store a "file system encoding" field. This will also
allow cross-platform restore to work. When restoring to a platform with a
different encoding than the source file system, the restore program can,
potentially, jump through all of the hoops that have been discussed ad
nauseum. It will have all of the data. For now I'm not going to bother with
handling the complicated cases, though. If that turns out to be important
later, I'll worry about it then.
Both of the "pathname decoded-with" and "file system encoding" fields will be
stored per-backup-session, but can be overridden per-file (just for
future-proofing).
Shawn.
More information about the tahoe-dev
mailing list