[tahoe-dev] Unicode issues review
Jan-Benedict Glaw
jbglaw at lug-owl.de
Mon Feb 23 03:09:06 PST 2009
On Wed, 2009-02-18 11:40:43 +0100, Francois Deppierraz <francois at ctrlaltdel.ch> wrote:
>
> unknown encoding -> Unicode -> UTF-8 -> Unicode -> unknown encoding
>
> I'm googling a bit to find out how other projects have implemented that.
I thought a bit longer about the topic, trying to find the more
interesting examples where filename encoding was an issue in the past
for me.
-1- Using DOS (FAT) or Windows, I had restrictions either with
what could be represented in the FS (DOS, some chars reserved,
generally interpreted with a locally configured codepage) or
with the Operating System imposing artificial limits to keep
compatibility (Windows+NTFS (UTF-16), not allowing certain
chars to keep old DOS applications happy.)
-2- Samba+NFS with mixed Windows and Linux clients. Initially,
the Samba server wasn't really configured wrt. filename
encoding, so (Windows) clients saved files with CP850 (western
europe, containing german umlauts) encoding. Later on, we
throughoutly switched to UTF-8 for the local store, which
"invalidated" the filenames, because they were broken in the
sense not being valid UTF-8.
-3- Shared NFS used from different machines/users using/preferring
different encodings. This was once cleaned up using UTF-8
throughoutly.
To draw a line, all the time the solution was converting the filenames
to UTF-8 (or UTF-16 in the NTFS case) for storing. With this in mind,
I'd implement exactly this:
* Store a file to Tahoe:
* If an iconv call converting the filename from UTF-8 to UTF-8
while //TRANSLIT is not set succeeds, I'd accept the filename,
store it (internally) as UTF-8.
* If the former didn't work, refuse the filename and *force* the
user supplying a from-charset name to convert it to UTF-8.
However, additionally always allow to supply a from-charset
name.
* Restore a file from Tahoe:
* Just try to use the UTF-8 encoded filename in the local
filesystem. Fail loudly if we get an error upon open().
* Always allow some switch to choose a to-charset name and
shift the internal buffer through iconv.
Besides clients using local file access, there'a also the web
interface. But I guess this is a quite simple thing, because UTF-8
should basically always work, as long as '<', '>', '"' and '&' are
quoted.
MfG, JBG
--
Jan-Benedict Glaw jbglaw at lug-owl.de +49-172-7608481
Signature of: Friends are relatives you make for yourself.
the second :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allmydata.org/pipermail/tahoe-dev/attachments/20090223/754832ac/attachment.pgp
More information about the tahoe-dev
mailing list