[tahoe-dev] [tahoe-lafs] #629: 'tahoe backup' doesn't tolerate 8-bit filenames

Sun May 24 09:03:40 PDT 2009

On Saturday 23 May 2009 03:14:18 pm tahoe-lafs wrote:
>  Why not take sqlite's advice and store only unicode strings in the db?
>  This would be consistent with the direction that we're going on the
>  {{{tahoe cp}}} issues: basically we're not going to support people who
>  want to have ill-encoded filenames or a wrongly configured locale and who
>  want their filenames to make round trips unchanged.  I don't think that
>  class of user is big enough (is it even non-null?) or their demands are
>  reasonable enough that we should keep trying to satisfy them as well as
>  satisfying people who have well-encoded filenames.

Not entirely related, but I'm (finally!) getting back to work on my GridBackup 
app, and I think I've decided to simplify my approach to filename encoding 
issues.

I've decided that since the common case is that files will be restored to a 
system with the same encoding they were backed up from, it doesn't make sense 
to put a great deal of effort into converting everything into proper Unicode 
to support cross-platform restores.

So, my intent is to store pathnames byte-for-byte identically to how they're 
stored in the file system.  Since I need to JSON-encode them, though, I do 
need to convert them to Unicode.  I don't need the conversion to be correct, 
though, just reliable.  There's also the issue of Windows to deal with.

For systems that may contain invalid or mixed encodings (i.e. not Windows), 
I'm reading raw strings and decoding them with latin1 to generate Unicode 
(regardless of what the file system encoding is).  The restore process can 
encode the Unicode with latin1 to recover the raw strings and write them.

For systems that cannot contain invalid encodings (i.e. Windows), I'll get 
Unicode from the FS and won't have to do anything to it before putting it in 
JSON.

To distinguish between the two, I'll store a "decoded-with" field which will 
be either latin1 or None.  This will indicate what the restore process needs 
to do to convert the Unicode to an appropriate raw or Unicode string, 
assuming it's being restored to a file system using the same encoding as the 
source.

To tell whether or not the destination file system uses the same encoding as 
the source, I'll also store a "file system encoding" field.  This will also 
allow cross-platform restore to work.  When restoring to a platform with a 
different encoding than the source file system, the restore program can, 
potentially, jump through all of the hoops that have been discussed ad 
nauseum.  It will have all of the data.  For now I'm not going to bother with 
handling the complicated cases, though.  If that turns out to be important 
later, I'll worry about it then.

Both of the "pathname decoded-with" and "file system encoding" fields will be 
stored per-backup-session, but can be overridden per-file (just for 
future-proofing).

	Shawn.