[tahoe-dev] Unicode issues review

Jan-Benedict Glaw jbglaw at lug-owl.de
Tue Feb 17 12:55:59 PST 2009


On Tue, 2009-02-17 11:09:09 -0700, zooko <zooko at zooko.com> wrote:
> On Feb 17, 2009, at 11:03 AM, Jan-Benedict Glaw wrote:
> > This happens.
> ...
> > So technically, "encoding" is a per-file property on some  
> > filesystems (those that don't care about a filename's contents, as  
> > long as it doesn't contain the directory delimiter (typically '/'  
> > or '\\') or the '\0' (end of string)).
> 
> Ugh.  This would be fine if the filesystem stored and provided the  
> information about what encoding was used for each name, but I'm  
> betting they don't do that.  :-)

They don't.  It's up to user's locale setting to tell about the
filename's encoding, or (final decision) the user may specifically
supply that information.

> So, what should Tahoe do?
> 
> 1.  Always treat filenames as opaque blobs.  This means Tahoe is  
> losing information that some filesystems (e.g. NTFS) provide, and  
> making it harder for users on the other side of Tahoe to  
> unambiguously decode those filenames.

What will we loose?

> 2.  If the filesystem guarantees a specific encoding, use that one,  
> else treat the filename as an opaque blob.
> 
> 3.  If the filesystem guarantees a specific encoding, use that one,  
> else if it provides a "default" encoding, then try to decode with  
> that one, and if decoding fails then reject the filename and ask the  
> user to fix it up.
> 
> 3.b.  ... and if decoding fails then treat the filename as an opaque  
> blob.
> 
> 3.c.  ... and if decoding fails then try to decode it with a few  
> dozen of our favorite encodings in descending order of popularity ...
> 
> 4.  Any other options?

I'd go for plain 3: Use known FS encoding (some btw. use one encoding
to store the filenames and another inside VFS, which is the relevant
encoding here), use user's locale and if that doesn't work, dork out.

Internally, I'd probably try to always represent filenames as UTF-8.

MfG, JBG

-- 
      Jan-Benedict Glaw      jbglaw at lug-owl.de              +49-172-7608481
Signature of:                http://catb.org/~esr/faqs/smart-questions.html
the second  :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://allmydata.org/pipermail/tahoe-dev/attachments/20090217/135e0ace/attachment.pgp 


More information about the tahoe-dev mailing list