[tahoe-dev] #534: "tahoe cp" command encoding issue

Fri Feb 27 09:57:08 PST 2009

Oh, and: if we had to fall back to Latin-1 on the inbound side  
(because of a UnicodeDecodeError), we set a separate metadata flag. On  
the outbound side, if this flag is set, we use the original filename  
bytestring. If not, we attempt to encode the unicode childname into  
the local scheme.

For filesystems with names that are mostly encodeable, this will give  
us proper unicode names for most files (which can be useful to others,  
on different platorms with different encodings), while still allowing  
the original uploader to get a faithful roundtrip. If they have files  
which misdecode (ie fn.decode(utf8) doesn't raise an exception but  
also doesn't produce the result that you want), they can use the --use- 
original flag to force a faithful roundtrip.

  -Brian

On Feb 27, 2009, at 9:45 AM, Brian Warner <warner at lothar.com> wrote:

> [must be brief, typing on an iphone, I'll write more on Monday when
> I've got a real keyboard]
>
> One limitation to keep in mind is that JSON cannot represent arbitrary
> binary data without application-visible encoding, and that both the
> webapi GET $dircap?t=json and the dirnode-format metadata dict use
> JSON. So any "store the original bytes and let the reader sort it out"
> approach must e.g. base32-encode those bytes on the way in and base32-
> decode them on the way out, in the CLI tool on the user side of the
> HTTP connection.
>
> How about this: we treat the child name (which has more users right
> now, in terms of lines of code which think they know how to interpret
> it) as being the "share with others" name: always unicode, but not
> always a faithful roundtrippable representation of the original. Then,
> for files which were copies from a local disk (like with "tahoe cp" or
> "tahoe backup", as opposed to a WUI operation), let's add a metadata
> field that is defined to hold the base32-encoded representation of the
> original uninterpreted filename bytestring, and treat this metadata
> field as the "note to myself" value, used to restore from a backup but
> not meant for other users.
>
> On the inbound side, if we can't decode the filename with the user's
> preferred encoding (which can default to utf-8, or utf-16 on windows,
> or something configured into python, etc), then we pretend to decode
> it with Latin-1, so that a human looking at the mangled unicode name
> can hopefully guess what the proper name should have been. We use the
> unicode result as the childname. In all cases, we store the orginal
> bytestring in the metadata.
>
> Then, on the outbound side, we add a --use-original-binary-filename
> option, which tells "tahoe cp" to ignore the unicode name and just use
> the bytestring from the metadata. Normally, we have it encode the
> unicode childname into the preferred charset (again with some
> defaults) and ignore the metadata.
>
> Thoughts?
>  -Brian
>
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev