[tahoe-dev] [tahoe-lafs] #1051: capabilities from the future could have non-ascii characters

Wed Jul 14 23:29:57 UTC 2010

#1051: capabilities from the future could have non-ascii characters
-----------------------------+----------------------------------------------
     Reporter:  zooko        |       Owner:  warner                                             
         Type:  enhancement  |      Status:  new                                                
     Priority:  major        |   Milestone:  1.7.1                                              
    Component:  code         |     Version:  1.6.1                                              
   Resolution:               |    Keywords:  forward-compatibility newcaps newurls review-needed
Launchpad Bug:               |  
-----------------------------+----------------------------------------------

Comment (by warner):

 David-Sarah's analysis in comment:14 is mostly in line with my thinking.

 I object less to "filecaps are UTF-8 encoding of some unicode string" than
 "filecaps are unicode strings". This would let us say that filecaps are
 bytestrings but with a constraint that {{{filecap.decode("utf-8")}}} must
 not
 throw an exception, and perhaps the additional constraint that
 {{{filecap.decode("utf-8").encode("utf-8")==filecap}}}. If we went this
 way,
 we should say that the UTF-8 -encoded form is the primary one (i.e., if
 you
 want to compare two filecaps, use {{{filecap1==filecap2}}}, not
 {{{filecap1.decode("utf-8")==filecap2.decode("utf-8")}}}.

 That still feels weird, though: UTF-8 is an encoding of something else,
 and
 in general you want to be comparing the primary form, not some encoding
 thereof. And filecaps *must* be unambiguous. If you wanted to visually
 compare two ASCII filecaps, you could do it easily (in fact the base32
 takes
 out the o/0 1/i/I/l/L homoglyphs). While I don't expect people to do this
 much, the fact that two unicode strings simply cannot be safely compared
 this
 way has got to be a bad sign.

 If we really must accept more than just ASCII, then I'd prefer to accept
 completely arbitrary bytestrings. The biggest problem with doing this is
 the
 t=json WAPI: if I'd taken this issue at all seriously when I built the
 webapi, I would have defined the t=json format to emit base64-encoded
 filecaps or something similar. (actually, at that point I did not yet
 realize
 that JSON could not handle arbitrary binary data.. if I had, I might have
 skipped JSON altogether and used protocol buffers or netstrings or
 something).

 But one option would be to have the t=json response leave out any filecap
 that cannot be expressed in printable ASCII (i.e., run a regexp against it
 before populating the child-info dictionary, replace it with an "unknown
 cap"
 marker if that fails). I can't remember if we covered this one during the
 earlier caps-from-the-future discussion.

 If we go with "filecaps are UTF-8 encoding of a unicode string", then the
 t=json API doesn't give enough information to clients to compare the real
 filecaps: all they can get is {{{filecap.decode("utf-8")}}} . In addition,
 at
 some point inside the webapi, we'd have to convert the filecaps into
 unicode
 before adding them to the JSON response. I'm really nervous about the
 information-losing behavior of unicode conversions, and security problems
 that can result.

 > Note that if we do use Unicode in caps in future, we should limit the
 > character set to characters for which normalization is not an issue.
 (There
 > are big blocks of Han characters with no equivalences, for example.)

 Ugh.. how can we make this safe? That is, when somebody pastes in a cap,
 how
 do we verify that it isn't using any characters in this set? Is this set
 even
 constant? When we're all speaking Lojban or Ilaksh or Marain or something
 in
 the future, won't there be new codepoints which the old code can't
 recognize
 as being non-normalizable?

 > A related but separate issue is how to plan for expansion to the
 V2/V3/etc
 > syntaxes.

 While parts of this may belong in other tickets, I think it remains
 relevant
 for this one. Your desire to plan for new things in our V1 filecaps might
 actually be a desire to define and implement those V2/V3 syntaxes (and
 improve the webapi to accept them, etc). So it may be better to leave the
 V1
 syntax definition alone, leave certain Tahoe interfaces intolerant to the
 potential new forms, and declare that we'll replace those interfaces with
 V2+-tolerant ones before we start using those forms.

 == Re: behaviour-utf8-future-caps.dpatch ==

 Why the s/name/namex/g ? Did you maybe mean to say "{{{name =
 unicode(namex)}}}"
 to highlight the transition from "unicode or bytestring" to "really
 unicode",
 and then leave the other instances of "name" alone?

 The {{{writecap = to_str(propropdict.get("rw_uri"))}}} line performs the
 unicode-to-UTF8 conversion. This means that webapi users calling
 {{{t=mkdir-with-children}}} or {{{t=mkdir-immutable}}} are giving us
 unicode,
 not UTF-8 bytestrings (i.e. tahoe gets
 {{{callerwritecap.decode("utf-8").encode("utf-8")}}}, because the JSON
 library is doing a decode before tahoe proper sees the data). Worse yet,
 the
 decode and the encode are being done by different pieces of code (I'd hope
 that the JSON library uses python's {{{.decode}}} logic, but who knows?).
 That's the best way to implement the unicode-caps design, but it also
 makes
 it clear that this is not an exact transformation.

 I didn't review it earlier, but nodemaker.create_from_cap(name=) is weird.
 I'd be concerned about unicode creeping into an exception instance and
 then
 causing bytestring-only logging to break (such as when it is written to
 twistd.log). I'm not sure what a good solution is: I see how it's a bit
 easier to pass "extraneous" information down into a function that might
 raise
 an exception (and stuff it into the exception message down there), rather
 than e.g. catch the exception higher up (where knowing name= is a bit more
 natural) and somehow gluing the name into the already-constructed
 exception
 object.

 == Re: test-utf8-future-caps.dpatch ==

 Hrm, could you reduce the instances of "failUnlessReallyEqual" to things
 that
 just test caps? Seeing it on things like
 {{{(c.getServiceNamed("storage"}.reserved_space, 0)}}} makes the patch
 awfully big. Hm, and if there were some clever way to make it the same
 length
 as "failUnlessEqual", that would reduce the noise even further (if you do
 this, which I don't think you should, note that
 len(assertTypeEqual)==len(failUnlessEqual)).

 I don't think using {{{failUnlessReallyEqual}}} in test_dirnode.py on
 things
 like {{{set(metadata.keys()}}} does everything you want it to: it will
 assert
 that both sides are of type Set, but it won't assert that the members of
 those sets are both of type string.

 In test_dirnode.py, I would call the new variables
 "future_unicode_write_uri", rather than "future_nonascii_write_uri", to
 make
 it clear that this is one possible direction (and that there are others).

 == Conclusions ==

 behaviour-utf8-future-caps.dpatch: yes, this patch is pretty harmless, I
 don't mind it going in.

 test-utf8-future-caps.dpatch: I see no problems with the patch per se, but
 I
 think the examples it uses set a bad precedent, by causing anyone reading
 the
 test to believe that tahoe's future caps will be unicode, which I think is
 a
 bad idea.

 I don't object to these two patches going in, but I will continue to
 object
 to the idea that the filecaps accepted by our existing interfaces (and
 stored
 in existing dirnodes) should be defined as unicode-encoded-to-UTF8. I
 think
 the best approaches are, in order of preference:

  1. continue to restrict filecaps to printable ASCII
  2. define filecaps as arbitrary bytestrings and replace the t=json WAPI
     interface which is unable to tolerate such a wide range

 I don't want to define filecaps to be unicode. Unicode exists to represent
 strings of written human languages. Filecaps are records/structs of
 cryptovalues. We have more tools to manipulate printable/copypastable
 strings
 than to manipulate abstract records of cryptovalues, so expressing
 filecaps
 as strings is convenient, but we should pick the encoding to serve tahoe's
 needs, rather than trying to make any conceivable written-human-language
 string meaningful as a tahoe filecap.

 That said, for users who have a solid unicode-friendly set of tools and
 want
 to tweet their filecaps, I don't object to an encoding scheme that somehow
 takes a filecap and expresses it as a string of unicode characters (this
 would be a "V4", in my V1/V2/V3 scheme from comment:11). But the tahoe
 interfaces that accept this need to be clearly marked, and I think the
 current t=json is not one of them.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1051#comment:19>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized file storage grid