[tahoe-dev] [rest-discuss] character encodings and binary data and URLs

Daniel Yokomizo daniel.yokomizo at gmail.com
Thu May 15 07:17:22 PDT 2008


On Thu, May 15, 2008 at 10:02 AM, zooko <zooko at zooko.com> wrote:
> Dear people of rest-discuss:
>
> What's the right way to specify character encoding in URLs, POST
> forms, and JSON-encoded data?
>
> I work on an open source secure, decentralized filesystem -- the
> "Tahoe" Least-Authority Filesystem [1] -- with a RESTful API [2].
>
> We need to decide, when the user specifies a filename, either in the
> URL or in a POST form, or or when the server returns a filename to
> the user in an HTTP response, what character encoding to use.
>
> Our current rules are like this:
>
> 1.  Filenames in URLs are always utf-8 encoded.  So after splitting
> on "/" to get individual segments, we utf-8 decode each segment
> before doing anything else with it.

This is a reference (together with RFC1738 and RFC3986) I use to solve
URL encoding issues:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

Assuming UTF-8 is kind of tricky though, you depend on clients using
UTF-8 which is the default for URI schemes that follow RFC3986 but
many common URI schemes predate the RFC. AFAIK assuming UTF-8 don't
hurt too much.

> 2.  POST forms have a _charset field which specifies the encoding of
> all the values in the form, and if not present it is assumed to be
> utf-8.

The Content-Type header can have a charset parameter (see 14.17). So
the server can see it to check for the charset:

Content-Type: text/html; charset=ISO-8859-4

To overcome browser deficiencies you may want to use a hidden field to
indicate that it overrides the header (e.g.
X-HTTP-Header-Override-Content-Type) and use it as a convention in the
server (while supporting the behavior from non browser clients).

> 3.  Responses are encoded in JSON, so the filenames are in unicode.

Response should also use Content-Type with charset.

> It seems unfortunate to constrain users of our system to unicode
> filenames, and further to constrain them to utf-8 encoding, but I
> can't think of another alternative which doesn't leave things
> underspecified.
>
> Are these type of rules for encoding fairly standard in the REST world?
>
> Thanks!
>
> Regards,
>
> Zooko
>
> [1] http://allmydata.org
> [2] http://allmydata.org/trac/tahoe/browser/docs/webapi.txt

Best regards,
Daniel Yokomizo.


More information about the tahoe-dev mailing list