#992 new enhancement

Store Content-Type as part of directory entries

Reported by: jsgf Owned by: somebody
Priority: major Milestone: undecided
Component: code Version: 1.6.0
Keywords: metadata integrity Cc:
Launchpad Bug:

Description

Some apps, particularly using the webapi, will want to associate proper content types with files.

Issue #947 proposes a complex scheme which ties metadata with the actual file object in some way.

I propose in this issue a simpler scheme:

  1. In the directory entry, have "content-type" and "content-encoding" entries in the normal metadata hash.
  1. On http PUT, create a directory entry inheriting the content-type and -encoding from the PUT request (if present). If one or both are not present, then leave the entries absent.
  1. On GET, use the content-type and -encoding from the metadata if present. Otherwise use the current scheme (guess the content type from the filename, defaulting to text/plain; no encoding).

To fill this out, the command-line tools would need some option to do nothing (current behaviour), explicitly set the type and encoding, or guess based on extension, magic-number sniffing, etc.

The content-type would be full content-type syntax with type/subtype and parameters.

Change History (9)

comment:1 Changed at 2010-03-12T05:12:58Z by jsgf

Oh, I was going to comment on backwards and forwards compatibility:

If a newer client sees old files without this metadata, then it will behave just as an older client. If the metadata is present, it will be returned with a GET/HEAD request, exactly as normal.

If an older client reads entries with the metadata, it will ignore it and behave as if they weren't there (ie, making up/guess its own mime type). It will be no worse off than it is now. It will create new entries without the metadata.

The main problem is that a mixture of old and new clients will see different metadata for the same files. This seems unavoidable.

comment:2 follow-up: Changed at 2010-03-12T06:41:50Z by davidsarah

  • Component changed from unknown to code
  • Keywords metadata integrity added
  • Owner changed from nobody to somebody
  • Summary changed from Store content-type and encoding as part of directory entries to Store Content-Type as part of directory entries
  • Type changed from defect to enhancement

#994 discusses Content-Encoding in more detail -- it is not sufficient to just store the Content-Encoding as metadata; the frontends also have to be able to decompress a compressed file in some cases.

Also, I don't think that storing Content-Encoding in edge metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous.

Changing this ticket to just be about Content-Type (and perhaps other edge metadata that Tahoe would not need to understand).

comment:3 in reply to: ↑ 2 Changed at 2010-03-12T17:44:40Z by jsgf

Replying to davidsarah:

Also, I don't think that storing Content-Encoding in edge metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous.

I see this as a feature, to an extent, as it allows the same bucket of bits to be presented in multiple ways depending on the path used to reference it. For example, you may want to refer to the same file as: "bigfile.txt.gz" with a content-type of application/gzip and no encoding, or as "bigfile.txt" with a content-type of text/plain and an encoding of gzip.

I agree that it is a pain there's no way to have a raw cap of a file with associated metadata, but I think that's the topic of #947. I have a half-formed idea about using a DIR cap of a vestigial directory containing a single nameless (but with metadata) pointing to the final file. But I haven't really thought it through.

comment:4 Changed at 2012-05-17T17:30:49Z by zooko

In this conversation on Google+ it occurred to me that if we had that metadata in the directory, then the URLs to children that are served up by a directory, e.g. this directory, which URLs currently contain a suggested filename, could also contain a suggested content-type, e.g. https://lafsgateway.zooko.com/file/URI%3ACHK%3Ags6rtdc74o4jxuv2frmni45tyu%3Apaqq4o4tquin7cqnigoogvuzsidikiss5yidsxskfurwcsl6g6ua%3A1%3A1%3A8294637/@@named=/Weinbergerinternet.mp3.wav.opus?and-by-the-way-mister-http-server-please-set-content-type=audio/ogg

comment:5 Changed at 2012-05-17T21:29:24Z by nejucomo

It is important for security for the web gateway to validate the syntax of the header in order to prevent response splitting attacks. Response splitting is an injection attack where the input spliced into a header field contains '\r\n' then possibly more headers, then possibly a complete response body.

This would allow a malicious directory (or file-cap-associated metadata) to impersonate the web gateway.

And of course for user-friendliness and defense in depth it would be nice if all clients and server-side metadata storage used the same validation parser. (ie: "tahoe put --content-type 'barf\0\r\nWhee!' myfile" would say something about an invalid content type before attempting any network io.)

comment:6 follow-up: Changed at 2012-05-17T22:01:17Z by davidsarah

Content-Type syntax is defined in http://tools.ietf.org/html/rfc2045#section-5.1. It's a bit overcomplicated so I suggest just restricting to printable characters (ASCII 0x20..0x7E), and possibly imposing a maximum length. That should be sufficient to prevent splitting attacks (and buffer overflow attacks against carelessly written parsers).

comment:7 in reply to: ↑ 6 Changed at 2012-05-17T22:07:16Z by davidsarah

Replying to davidsarah:

Content-Type syntax is defined in http://tools.ietf.org/html/rfc2045#section-5.1.

... except that the grammar there does not allow spaces, and in practice all implementations do allow spaces, at least after ';'.

Version 0, edited at 2012-05-17T22:07:16Z by davidsarah (next)

comment:8 Changed at 2012-05-17T22:07:41Z by nejucomo

Notice that GET /file/...@@named=... is similar to directory-associated edge metadata. A URL is an edge, even when it is not stored in a directory.

So in some sense, adding more features like @@named is similar to adding edge-associated metadata in a directory.

If we add some metadata in the @@-style and different conventions in the dirnode metadata, the interface will grow more complex and confusing over time. For that reason, I'm a fan of jsgf's "anonymous singleton directory" idea because it could replace the ad-hoc @@ requests with a single standard for directory metadata. (Maybe there are still webapi / ui headaches around this approach, though.)

comment:9 Changed at 2012-05-17T22:09:17Z by nejucomo

Maybe a better approach than a "singleton directory" is to just ensure that every kind of dirnode edge metadata is exposed to the /file/...@@ interface in a uniform way.

Note: See TracTickets for help on using tickets.