[tahoe-dev] [tahoe-lafs] #1051: capabilities from the future could have non-ascii characters
tahoe-lafs
trac at tahoe-lafs.org
Wed Jul 14 23:29:57 UTC 2010
#1051: capabilities from the future could have non-ascii characters
-----------------------------+----------------------------------------------
Reporter: zooko | Owner: warner
Type: enhancement | Status: new
Priority: major | Milestone: 1.7.1
Component: code | Version: 1.6.1
Resolution: | Keywords: forward-compatibility newcaps newurls review-needed
Launchpad Bug: |
-----------------------------+----------------------------------------------
Comment (by warner):
David-Sarah's analysis in comment:14 is mostly in line with my thinking.
I object less to "filecaps are UTF-8 encoding of some unicode string" than
"filecaps are unicode strings". This would let us say that filecaps are
bytestrings but with a constraint that {{{filecap.decode("utf-8")}}} must
not
throw an exception, and perhaps the additional constraint that
{{{filecap.decode("utf-8").encode("utf-8")==filecap}}}. If we went this
way,
we should say that the UTF-8 -encoded form is the primary one (i.e., if
you
want to compare two filecaps, use {{{filecap1==filecap2}}}, not
{{{filecap1.decode("utf-8")==filecap2.decode("utf-8")}}}.
That still feels weird, though: UTF-8 is an encoding of something else,
and
in general you want to be comparing the primary form, not some encoding
thereof. And filecaps *must* be unambiguous. If you wanted to visually
compare two ASCII filecaps, you could do it easily (in fact the base32
takes
out the o/0 1/i/I/l/L homoglyphs). While I don't expect people to do this
much, the fact that two unicode strings simply cannot be safely compared
this
way has got to be a bad sign.
If we really must accept more than just ASCII, then I'd prefer to accept
completely arbitrary bytestrings. The biggest problem with doing this is
the
t=json WAPI: if I'd taken this issue at all seriously when I built the
webapi, I would have defined the t=json format to emit base64-encoded
filecaps or something similar. (actually, at that point I did not yet
realize
that JSON could not handle arbitrary binary data.. if I had, I might have
skipped JSON altogether and used protocol buffers or netstrings or
something).
But one option would be to have the t=json response leave out any filecap
that cannot be expressed in printable ASCII (i.e., run a regexp against it
before populating the child-info dictionary, replace it with an "unknown
cap"
marker if that fails). I can't remember if we covered this one during the
earlier caps-from-the-future discussion.
If we go with "filecaps are UTF-8 encoding of a unicode string", then the
t=json API doesn't give enough information to clients to compare the real
filecaps: all they can get is {{{filecap.decode("utf-8")}}} . In addition,
at
some point inside the webapi, we'd have to convert the filecaps into
unicode
before adding them to the JSON response. I'm really nervous about the
information-losing behavior of unicode conversions, and security problems
that can result.
> Note that if we do use Unicode in caps in future, we should limit the
> character set to characters for which normalization is not an issue.
(There
> are big blocks of Han characters with no equivalences, for example.)
Ugh.. how can we make this safe? That is, when somebody pastes in a cap,
how
do we verify that it isn't using any characters in this set? Is this set
even
constant? When we're all speaking Lojban or Ilaksh or Marain or something
in
the future, won't there be new codepoints which the old code can't
recognize
as being non-normalizable?
> A related but separate issue is how to plan for expansion to the
V2/V3/etc
> syntaxes.
While parts of this may belong in other tickets, I think it remains
relevant
for this one. Your desire to plan for new things in our V1 filecaps might
actually be a desire to define and implement those V2/V3 syntaxes (and
improve the webapi to accept them, etc). So it may be better to leave the
V1
syntax definition alone, leave certain Tahoe interfaces intolerant to the
potential new forms, and declare that we'll replace those interfaces with
V2+-tolerant ones before we start using those forms.
== Re: behaviour-utf8-future-caps.dpatch ==
Why the s/name/namex/g ? Did you maybe mean to say "{{{name =
unicode(namex)}}}"
to highlight the transition from "unicode or bytestring" to "really
unicode",
and then leave the other instances of "name" alone?
The {{{writecap = to_str(propropdict.get("rw_uri"))}}} line performs the
unicode-to-UTF8 conversion. This means that webapi users calling
{{{t=mkdir-with-children}}} or {{{t=mkdir-immutable}}} are giving us
unicode,
not UTF-8 bytestrings (i.e. tahoe gets
{{{callerwritecap.decode("utf-8").encode("utf-8")}}}, because the JSON
library is doing a decode before tahoe proper sees the data). Worse yet,
the
decode and the encode are being done by different pieces of code (I'd hope
that the JSON library uses python's {{{.decode}}} logic, but who knows?).
That's the best way to implement the unicode-caps design, but it also
makes
it clear that this is not an exact transformation.
I didn't review it earlier, but nodemaker.create_from_cap(name=) is weird.
I'd be concerned about unicode creeping into an exception instance and
then
causing bytestring-only logging to break (such as when it is written to
twistd.log). I'm not sure what a good solution is: I see how it's a bit
easier to pass "extraneous" information down into a function that might
raise
an exception (and stuff it into the exception message down there), rather
than e.g. catch the exception higher up (where knowing name= is a bit more
natural) and somehow gluing the name into the already-constructed
exception
object.
== Re: test-utf8-future-caps.dpatch ==
Hrm, could you reduce the instances of "failUnlessReallyEqual" to things
that
just test caps? Seeing it on things like
{{{(c.getServiceNamed("storage"}.reserved_space, 0)}}} makes the patch
awfully big. Hm, and if there were some clever way to make it the same
length
as "failUnlessEqual", that would reduce the noise even further (if you do
this, which I don't think you should, note that
len(assertTypeEqual)==len(failUnlessEqual)).
I don't think using {{{failUnlessReallyEqual}}} in test_dirnode.py on
things
like {{{set(metadata.keys()}}} does everything you want it to: it will
assert
that both sides are of type Set, but it won't assert that the members of
those sets are both of type string.
In test_dirnode.py, I would call the new variables
"future_unicode_write_uri", rather than "future_nonascii_write_uri", to
make
it clear that this is one possible direction (and that there are others).
== Conclusions ==
behaviour-utf8-future-caps.dpatch: yes, this patch is pretty harmless, I
don't mind it going in.
test-utf8-future-caps.dpatch: I see no problems with the patch per se, but
I
think the examples it uses set a bad precedent, by causing anyone reading
the
test to believe that tahoe's future caps will be unicode, which I think is
a
bad idea.
I don't object to these two patches going in, but I will continue to
object
to the idea that the filecaps accepted by our existing interfaces (and
stored
in existing dirnodes) should be defined as unicode-encoded-to-UTF8. I
think
the best approaches are, in order of preference:
1. continue to restrict filecaps to printable ASCII
2. define filecaps as arbitrary bytestrings and replace the t=json WAPI
interface which is unable to tolerate such a wide range
I don't want to define filecaps to be unicode. Unicode exists to represent
strings of written human languages. Filecaps are records/structs of
cryptovalues. We have more tools to manipulate printable/copypastable
strings
than to manipulate abstract records of cryptovalues, so expressing
filecaps
as strings is convenient, but we should pick the encoding to serve tahoe's
needs, rather than trying to make any conceivable written-human-language
string meaningful as a tahoe filecap.
That said, for users who have a solid unicode-friendly set of tools and
want
to tweet their filecaps, I don't object to an encoding scheme that somehow
takes a filecap and expresses it as a string of unicode characters (this
would be a "V4", in my V1/V2/V3 scheme from comment:11). But the tahoe
interfaces that accept this need to be clearly marked, and I think the
current t=json is not one of them.
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1051#comment:19>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list