[tahoe-dev] #534: "tahoe cp" command encoding issue

Fri Feb 27 20:51:19 PST 2009

>>>>> "zooko" == zooko  <zooko at zooko.com> writes:

    zooko> 2.d: Whatever that filename would have been if it had
    zooko> actually been encoded in latin-1 in the first place.  (I.e.,
    zooko> some sort of gibberish, if it wasn't actually latin-1.)

    zooko> 2.d&1/2: The same as 2.d, but prepended with the the U+FFFC
    zooko> char

    zooko> 2.e: Whichever characters of that filename *are* legitimate
    zooko> for the filesystem's default codec, interspersed with U+FFFD
    zooko> "replacement characters" for any characters that aren't
    zooko> legitimate for the default codec.

... and save a latin1-encoded equivalent to that of 2.d in metadata.

    zooko> I tend to think that the first of those three options is the
    zooko> best, but I would defer to any established "best practices"
    zooko> among unicode gurus.  Remember that we're only talking about
    zooko> backwards- compatibility here -- the behavior of old tahoe
    zooko> clients who don't know how to do anything but treat the
    zooko> "child name" as a unicode string.  Also lazy tahoe clients
    zooko> who don't bother to check for this condition and get the
    zooko> original bytes and do "Whatever it is that diligent clients
    zooko> are supposed to do with a bunch of bytes in some unknown
    zooko> encoding.".

With 2.d and 2.e both old and new thaoe client will be able to manage
the filename at their best with the same results. With 2.d&1/2 new tahoe
clients will behave correctly but older ones will prepend a character to
every filename that failed decoding on upload. Talking about sizes, the
real winner is 2.d&1/2, but it feels to me like retourn to the ages when
special flags were encoded in strings and you had to use those diagrams
like the one for bitmapped registers on processors documentation:). 2.d
alway puts a flag "latin_fallback: bool" in metadata while 2.e appends a
latin1-encoded filename string in metadata when needed.

Codecs in python have three kind of behavior on decoding/encoding: raise
an exception, ignore erroneus characters and strip them during
conversion, replace erroneus characters with a placeholder. That said, 
the purpose of the latin1 decoding and encoding on all the three
proposals it's just a way,  serializable with JSON, for being able to
get the original bytestring on download.
Near this, there is the fact that tahoe comes with a webapi and due to
this and the web/js hype, i think that interoperability is a strength
and that certainly will be more or less generic clients that will act as 
"consumers" of thaoe grids.
It's impossible to immagine all the situations in advance and i
understand Zooko's tendency to focus on what is known: old and new tahoe
clients, but from a "meaning" perspective in the case of 2.d is wrong to
publish in the child name a some characters that have an unknown meaning
and that so are wrongly mapped to unicode entities hoping that the
client will know how to handle this situation. I think that's much
better make the client aware that there were some characters that failed
the mapping, still preserving the option for a more advanced conversion
strategy using the latin1-decoded string in metadata (2.e)

It's time to go to bed!

Alberto