[tahoe-dev] Unicode issues review

Jan-Benedict Glaw jbglaw at lug-owl.de
Mon Feb 23 03:09:06 PST 2009


On Wed, 2009-02-18 11:40:43 +0100, Francois Deppierraz <francois at ctrlaltdel.ch> wrote:
> 
> unknown encoding -> Unicode -> UTF-8 -> Unicode -> unknown encoding
> 
> I'm googling a bit to find out how other projects have implemented that.

I thought a bit longer about the topic, trying to find the more
interesting examples where filename encoding was an issue in the past
for me.

  -1-	Using DOS (FAT) or Windows, I had restrictions either with
	what could be represented in the FS (DOS, some chars reserved,
	generally interpreted with a locally configured codepage) or
	with the Operating System imposing artificial limits to keep
	compatibility (Windows+NTFS (UTF-16), not allowing certain
	chars to keep old DOS applications happy.)

  -2-	Samba+NFS with mixed Windows and Linux clients.  Initially,
	the Samba server wasn't really configured wrt. filename
	encoding, so (Windows) clients saved files with CP850 (western
	europe, containing german umlauts) encoding. Later on, we
	throughoutly switched to UTF-8 for the local store, which
	"invalidated" the filenames, because they were broken in the
	sense not being valid UTF-8.

  -3-	Shared NFS used from different machines/users using/preferring
	different encodings. This was once cleaned up using UTF-8
	throughoutly.


To draw a line, all the time the solution was converting the filenames
to UTF-8 (or UTF-16 in the NTFS case) for storing. With this in mind,
I'd implement exactly this:

  * Store a file to Tahoe:
	* If an iconv call converting the filename from UTF-8 to UTF-8
	  while //TRANSLIT is not set succeeds, I'd accept the filename,
	  store it (internally) as UTF-8.
	* If the former didn't work, refuse the filename and *force* the
	  user supplying a from-charset name to convert it to UTF-8.
	  However, additionally always allow to supply a from-charset
	  name.

  * Restore a file from Tahoe:
	* Just try to use the UTF-8 encoded filename in the local
	  filesystem. Fail loudly if we get an error upon open().
	* Always allow some switch to choose a to-charset name and
	  shift the internal buffer through iconv.


Besides clients using local file access, there'a also the web
interface. But I guess this is a quite simple thing, because UTF-8
should basically always work, as long as '<', '>', '"' and '&' are
quoted.

MfG, JBG

-- 
      Jan-Benedict Glaw      jbglaw at lug-owl.de              +49-172-7608481
Signature of:                 Friends are relatives you make for yourself.
the second  :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allmydata.org/pipermail/tahoe-dev/attachments/20090223/754832ac/attachment.pgp 


More information about the tahoe-dev mailing list