[tahoe-dev] Unicode issues review

Brian Warner warner-tahoe at allmydata.com
Tue Feb 17 15:34:33 PST 2009


On Tue, 17 Feb 2009 12:52:56 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:

> Since you control the dirnode format, wouldn't it be easier just to
> add a "this isn't Unicode" flag, rather than translating to a
> reserved range?  If the flag is set, the name is opaque binary data.
> Otherwise, it's UTF-8.
> 
> Hmm.  That suggests another option for Zooko's list:  Provide
> per-file name encoding in the dirnode format.  Set it to whatever the
> FS says it should be set to.

Hrm. Well, we could rev the dirnode format (introducing a compatibility
break: older tahoe clients would be unable to read those directories). We've
been planning to do this anyways, when we move to ECDSA-based dirnodes (to
add traversal caps, and to remove the now-redundant HMAC, and to let the
tables be parsed faster). Such a break would let us add a childname-encoding
field for each child.

Or, we could add a childname-encoding field into the metadata in the current
dirnode format, which would be more backwards-compatible.

The larger problems remains though: even if we change Tahoe's internal format
to use (encoding, encoded-name-bytestring) instead of (utf8-name-bytestring),
how should the webapi share this with the outside world? The primary
machine-oriented webapi (which zooko refers to as the "WAPI", to distinguish
it from the human/browser-oriented "WUI") uses JSON to publish the directory
contents, and JSON only supports unicode strings, so any bytestrings would
have to be encoded down to ASCII (like, base64 or something), and the clients
would have to expect an (encoding, encoded-name-base64string) where they
currently get (unicode-name). Eww. This would expose the compatibility break
to webapi clients.

There would also be a number of internal changes; we'd probably want to
define a AnyEncodingString class, which would behave somewhat like a unicode
object, but would internally contain an encoding-name and a bytestring.

The idea of defining tahoe dirnodes as using Unicode was to accomodate
everything. It's a pity that the problems seem to lie in 0x80-0xff, rather
than in some more exotic code plane.. like a runner jumping out of the
starting blocks to find that their shoelaces are tied together.

I suppose that 99% of local file names *are* representable in unicode
somehow, but the real problem is that the node (on the near side of
os.listdir) doesn't know what encoding to use, and the lack of a clear way to
pick one.

Ah, which means that storing childname-encoding in the dirnode doesn't
actually help, because the real problem is that we don't know what that
encoding is. If we knew what value to store, we could have simply converted
the childname into unicode and then into UTF-8. Unless we permitted a "I
don't know what encoding this bytestring is" value: that would perhaps tell
the output side to simply feed the unknown-encoding bytestring to open() and
hope that the downloading user is using the same conventions as the uploading
user was.

Sigh. As the t-shirt says, "I (empty square box) Unicode".

 -Brian


More information about the tahoe-dev mailing list