[tahoe-dev] [tahoe-lafs] #629: 'tahoe backup' doesn't tolerate 8-bit filenames

Shawn Willden shawn-tahoe at willden.org
Sun May 24 13:08:17 PDT 2009


On Sunday 24 May 2009 11:03:56 am Zooko Wilcox-O'Hearn wrote:
> It sounds to me like your design will store enough information to
> enable any possible future improvement, but that the first version
> will give mojibake results if you backup from a linux system (even
> one with all filenames correctly encoded using the declared locale)
> and then restore on a mac system.  Is that your intent?

No, even the first version will transcode between systems with different 
encodings (though in your example, most Linux systems are UTF-8, and OS X is 
UTF-8).

The key difference between this approach and most of what has been discussed 
is that it defers all effort to properly decode and convert names to the 
point of retrieval.

Backups ALWAYS succeed, because they don't try to do anything other than 
preserve the data.

Restores will usually succeed, and when they don't we can then figure out how 
to make them succeed.

The restore algorithm looks like:

1. Decode JSON and retrieve transport Unicode
2. Apply "decoded-with" encoder to recover source platform raw string.
3. Apply "platform-codec" decoder to (hopefully) obtain correct Unicode.
4. Apply destination platform encoding to get destination encoding.

For names from Windows source systems, "decoded-with" with be None, so steps 2 
and 3 will be skipped.  If the source and destination platforms use the same 
codec, steps 3 and 4 could be skipped.

For names that are invalid, the decoding in step 3 may fail, which is an 
error.  Ultimately we can expend much cleverness trying to address those 
errors.  If that turns out to be important, fine, the data needed will be 
available.

	Shawn.


More information about the tahoe-dev mailing list