[tahoe-dev] String encoding in tahoe
Brian Warner
warner-tahoe at allmydata.com
Fri Jan 2 16:28:21 PST 2009
On Tue, 23 Dec 2008 16:09:25 -0600
"Dan McNair" <glucnac at gmail.com> wrote:
> Curious: does Tahoe support arbitrary binary strings as filenames in the
> backend, or only accept certain encodings? HTTP certainly supports arbitrary
> byte sequences, ugly though it may be. I don't recall anything from my scan
> of the DIR2 documentation that would cause problems with filenames in
> arbitrary encoding(s).
Tahoe's directories are specified to have child names which are unicode
strings. Internally it encodes those unicode strings into UTF-8 before
serializing them into the mutable file contents, but that should be opaque to
clients.
As a result of this specification, Tahoe cannot accept arbitrary 0x80-0xff
bytes in filenames. When the user is trying to take a non-unicode bytestring
(say, from their local disk filesystem) and use it in a Tahoe directory,
we'll have problems.
There was a thread on the python-dev mailing list about this sort of thing
about a month ago, in the context of how python3.0 ought to handle the
program's external boundaries (sys.argv, sys.environ, os.listdir, etc). I
think it was Glyph who pointed out that some systems (KDE?) actually convert
high-bit non-ASCII bytes into a special reserved range of unicode, so that
they can at least reverse the transformation and restore the original
(non-unicode who-knows-what-encoding) filename later on. Tahoe could
conceivably do the same.
Tahoe's internal dirnode interfaces (add_child, list, rename, delete, etc)
are all defined in terms of unicode objects (and throw an exception if you
give them a bytestring instead of a unicode instance). We should push this
requirement out as far as we can, which is basically the boundary of the
program (sys.argv, or the webapi's HTTP URL / form body). If the OS has some
way to define what encoding is being used for the filename-ish pieces of
sys.argv (maybe sys.getfilesystemencoding() or something?), then we can use
that, otherwise the intent of the current CLI code is to assume UTF-8. The
webapi is intended to require UTF-8 in the URL, and to use the "_charset"
convention in form bodies (and default to UTF-8 if not provided).
cheers,
-Brian
More information about the tahoe-dev
mailing list