[tahoe-dev] Bringing Tahoe ideas to HTTP

Tue Sep 1 12:35:31 PDT 2009

I still say making names be secrets is a losing strategy. Especially
if they are actually URLs in actual web pages. Having sensitive
information in URLs is a bug.

Although I agree CAs are fatally flawed (but that TLS *the protocol*
is fine), I think having server auth + connection integrity +
connection confidentiality is "good enough". In any case, web
applications tend to have bugs far worse than the arguable
non-e2e-purity of the transport. I'm an e2e fan, but I'm happy with
TLS. The big problem is the means by which the browser recognizes
server identity and document integrity. Solving those problems (as
hard as they are) is likely to be easier than fundamentally changing
the most deployed protocol on the internet.

I take that back. The big problem is that browsers don't tell users
that HTTP service is completely unsafe. (Compare with OTR's nice "Not
Private" notice. Even better: the notice is a button you can click on
to see ways to address the problem. Lemme know when a browser does
that!)

More on document integrity: I've heard of a proposal called HTTPA
("authenticated") that is like your hash tag idea, except the hash is
stored as an attribute in an <a ...> tag: <a href="whatevs.html"
hash="DEADBEEF">Whatevs</a>. Needless to say, MD5 is useless, and so
is bothering to check these hashes unless you received them securely
in the first place (the containing page was retrieved via HTTPS, for
example).

I thought Merkle trees were a cause of bad alacrity?
http://allmydata.org/trac/tahoe/ticket/670 Even your solution is
orders of magnitude than is acceptable for the entire web page load
process. Granted, that's for huge files, larger than HTTP is normally
used for.

On Thu, Aug 27, 2009 at 2:02 PM, Brian Warner<warner at lothar.com> wrote:
>
> At lunch yesterday, Nathan mentioned that he is interested in seeing how
> Tahoe's ideas and techniques could trickle outwards and influence the
> design of other security systems. And I was complaining about how the
> Firefox upgrade process doesn't provide the integrity checks that I want
> (it turns out they rely upon the CA infrastructure and SSL alone, no
> end-to-end checking; the updates and releases are GPG-signed, but
> firefox doesn't check that, only humans might). And PyPI has this nice
> habit of appending "#md5=XYZ.." to the URLs of the release tarballs that
> they publish, which is (I think) automatically used by tools like
> easy_install to guard against corrupted downloads (and which I always
> use, as a human, to do the same). And Nathan mentioned a class of web
> attacks in which a page, loaded over SSL, imports something (JS, CSS,
> JPG) via a regular http: URL, and becomes vulnerable to third-parties
> who can take over the page by controlling what arrives over
> unauthenticated HTTP.
>
> So, setting aside the reliability-via-distributedness properties for a
> moment, what could we bring from Tahoe into regular HTTP and regular
> webservers that could improve the state of security on the web?
>
> == Integrity ==
>
> To start with integrity-checking, we could imagine a firefox plugin that
> validated a PyPI-style #md5= annotation on everything it loads. The rule
> would be that no action would be taken on the downloaded content until
> the hash was verified, and that a hash failure would be treated like a
> 404. Or maybe a slightly different error code, to indicate that the
> correct resource is unavailable and that it's a server-side problem, but
> it's because you got the wrong version of the document, rather than the
> document being missing altogether.
>
> This would work just fine for a flat hash: the original file remains
> untouched, only the referencing URLs change to get the new hash
> annotation. Non-enhanced browsers are unaffected: the #-prefixed
> fragment identifier is never sent to the server, and the <a name=> tag
> is fairly rare these days (and would still mostly work). Container files
> (the HTML which references the hashed documents) could be updated to
> benefit at leisure. Automation (see below) could be used to update the
> URLs in the containers whenever the referenced objects were modified.
>
> To improve alacrity on larger files, Tahoe uses a Merkle tree over
> segments of the file. This tree has to be stored somewhere (Tahoe stores
> it along with the shares, but it would be more convenient for a web site
> to not modify the source files). We could use an annotation like
> "#hashtree=ROOTXYZ;http://otherplace" to reference an external hash tree
> (with root hash XYZ). The plugin would start pulling from the source
> file and the hash tree at the same time, and not deliver any source data
> until it had been validated. The hashtree object would need to start
> with the segment size and filesize, so the tree could be computed
> properly. For very large files, you could read those parameters and then
> pull down (via a Range: header) just the parts of the Merkle tree that
> were necessary. In this case, the automation would need to create the
> hash tree file and put it in a known place each time the source file
> changes, and then updated the references.
>
> (note that "ROOTXYZ" provides the "identification" properties of this
> annotation, and "http://otherplace" provides the "location" properties,
> where identification means the ability to recognize the correct document
> if someone gives it to you, and location means the ability to retrieve a
> possibly-correct document. URIs provide identification, URLs are
> supposed to provide both.)
>
> We could compress this by establishing an (overriable) convention that
> http://example.com/foo.mp3 always has a hashtree at
> http://example.com/foo.mp3.hashtree, resulting in a URL that looked like
> "http://example.com/foo.mp3#hashtree=ROOTXYZ". If you needed to store it
> elsewhere, you could use "#hashtree=ROOTXYZ;WHERE", and define WHERE to
> be a relative URL (with a default value of NAME.hashtree).
>
> == Mutable Integrity ==
>
> Zooko and I have both run HTML presentations out of a Tahoe grid (which
> makes for a great demo), and the first thing you learn there is that
> immutability, while a great property in some cases, is a hassle for
> authoring. You need mutability somewhere, and the more places you have
> it, the fewer URLs you have to update every time you change something.
> In technical terms, you frequently want to cut down the diameter of the
> immutable domains of the object DAG, by splitting those domains with
> mutable boundary nodes. In practical terms, it means you might want to
> publish *everything* via a mutable file. At the very least, if your web
> site has any internal cycles in it, you'll need a mutable node to break
> the cycle.
>
> Again, this requires data beyond the contents of the source file. We
> could use a "#sigkey=XYZ" annotation with a base62'ed ECDSA pubkey (this
> would provide the "identification" property of the constant pubkey), but
> we'd still need to know where to get the actual signature (the
> "location" property of the variable signature). We could do
> "#sigkey=XYZ;sigurl=http://otherplace". Or we could establish a
> convention of keeping the signature files next to the source files with
> "#sigkey=XYZ;sigsuffix=.sig" (and then http://example.com/main.css would
> have its signature stored in http://example.com/main.css.sig). Or,
> compress the convention further and have "sigkey=" imply
> "sigsuffix=.sig" unless overridden.
>
> This would involve two GETs, but they'd be done in parallel, and the
> original files would remain untouched (thus unaware browsers would be
> unaffected, obliviously content in their insecurity). The immutable
> "#hashtree=" would also involve two parallel GETs, but presumably it'd
> only be used for large files, in which the overhead would be less
> noticeable. Whereas the mutable "#sigkey=" would be used for even small
> files, so you might notice the overhead more.
>
> The .sig file would probably contain a copy of the pubkey too, for local
> verification purposes. If we used a signature scheme that didn't give us
> short-enough pubkeys, the .sig file would contain the whole pubkey, and
> the #sigkey=XYZ suffix would contain its hash.
>
> == Encryption ==
>
> Now, how could we provide fine-grained confidentiality? We all know how
> broken the SSL+CA model is. Tahoe uses per-object encryption keys that
> are tightly bound to the object identifiers, providing obj-cap
> properties (like fine-grained delegation) and also honoring the
> end-to-end argument.
>
> Obviously, this step requires abandoning the unmodified browser. Goodbye
> unmodified browser! Now, the plugin-enhanced browsers that are left can
> recognize a new URL scheme. Let's call it "x-yzzy:" for now (I don't
> want to use "tahoe:" for this purpose, since I still want that for
> *distributed* secure files). These URLs will look like
> "x-yzzy://example.com/READKEY.UEBHASH", and behave just like Tahoe
> immutable readcaps for 1-of-1 encoded files except they reference the
> single host where you can get the sole share (instead of permuting an
> out-of-band serverlist to find a set of likely places for k shares). The
> READKEY would be hashed to form a storage-index, then the plugin would
> fetch http://example.com/STORAGEINDEX (base64-encoded), which would
> contain an encrypted+hashed version of the plaintext. The hash
> information would include both a flat hash and a merkle tree, covered by
> a UEB just like in tahoe (except we could drop the block hash tree since
> k=1).
>
> For mutable files, the URL would be "x-yzzy://example.com/MUTREADKEY",
> which would be even shorter (2*kappa instead of (1+2)*kappa, if I'm
> remembering the necessary length of the hash correctly). Again,
> MUTREADKEY is hashed to form a storage-index, the corresponding
> ciphertext+hashes+signature file is fetched, the hashes checked, the
> signature checked, the data decrypted, and delivered to the caller.
>
> Web servers would be completely unaffected: they'd just have directories
> full of base64-encoded (or base62, or a modified base64 without "/", or
> whatever) filenames, which they serve to anyone who cares. All GETs
> would use unencrypted http, since this protocol would provide both
> integrity and confidentiality.
>
> Oh, and the rule would be that the storage-index would be treated as a
> URL relative to the http equivalent of the original x-yzzy URL. So
> "x-yzzy://example.com/subdir/READKEY.UEBHASH" would get an encrypted
> blob from "http://example.com/subdir/STORAGEINDEX".
>
> == Tools ==
>
> You'd start with a hashing tool: given a file, emit the "#hash=XYZ"
> suffix that should be tacked on to the URL. Or, given an URL prefix and
> a webroot-relative filename, emit the whole URL.
>
> Then you'd move on to the merkle tree generation tool. Given FILENAME,
> it writes the hash tree data to FILENAME.hashtree, and emits the
> "#hashtree=XYZ" suffix that you need to attach to the URL.
>
> The mutable-file tool would maintain an out-of-webroot file mapping
> pubkey to privkey. It would create a new keypair when run on a file that
> did not already have a .sig file, or would extract the old pubkey from
> an existing .sig file and look up the corresponding signing key. It
> would emit the #sigkey=XYZ suffix, and update or create the .sig file
> (next to the original data file) with the new signature.
>
> The encryption+immutable tool would take a file (from your source
> directory, which of course would *not* be under the webroot), produce
> the encrypted+hashed tahoe-like single-share output data, store it in
> the webroot under the storage-index name, and emit the URL.
>
> The encryption+mutable tool would do the same, taking the existing key
> from an adjoining .key file (or creating a new one), putting the
> signed+hashed+encrypted data in the webroot, and emitting the URL.
>
> == Automation ==
>
> Now, what's a good way to update all the container files? I.e., when you
> change your CSS and it gets a new hash, how should you update the .html
> file that references it? I've been using Git a lot recently, and it gave
> me an idea:
>
>  * store your website in Git or Mercurial (you *do* manage your website
>   in a revision control system, right? and the system you picked *does*
>   give you cryptographically-strong file-version identifiers, right?)
>
>  * use regular relative URLs in the .html files that you check in; web
>   authors remain unaware of the integrity-checking suffixes that gets
>   added later
>
>  * now build a tool that rewrites the HTML (and other containers, JS and
>   perhaps CSS) to replace the relative URLs with URL#hash=XYZ . The
>   tool runs at checkout time, when you deploy a new revision to the
>   webserver, or takes a git checkout (with all repository metadata) as
>   input and produces the webroot directories as output.
>
>  * The tool will build a table that says "bar.css has hash=XYZ" for
>   everything that gets checked out.
>
>  * take advantage of git's hash-of-data content-tracking properties to
>   cache the table that maps object to #hash=XYZ values: instead of "the
>   current version of bar.css has hash=XYZ", remember "version ABC of
>   bar.css will always have hash=XYZ".
>
>  * build a table that says "version ABC of foo.html references bar.css
>   and baz.js", to capture the object graph. Invert the table ("bar.css
>   is referenced by version ABC of foo.html, among others"). Now you can
>   quickly tell what files need rewriting when bar.css is modified. New
>   versions of foo.html get rescanned, added to the who-references-whom
>   table, then processed (hashed) and added to the whats-your-hash
>   table, then anyone who references it gets updated.
>
>  * keep careful track of containers (objects which reference other
>   objects). If bar.css imports booze.css, then while the original
>   contents of bar.css might not change, the annotated version (which
>   includes "booze.css#hash=XYZ") will change whenever booze.css
>   changes. The tables must reflect this, so that the updating scheme
>   will catch everything
>
>  * the last step should be a sanity check, walking through all the
>   output files, and comparing the #hash=XYZ values therein with the
>   actual hashes of the other output files.
>
>  * the generated tables can be used to alert you to immutable-reference
>   cycles, which are a no-no, and require mutability somewhere to break
>   the circle and turn the graph back into a strict DAG.
>
> Then, when you introduce mutability, you somehow mark the filenames that
> you want to be delivered as mutable (breaking cycles and reducing
> reference-updating effort, in exchange for possibly slowing down client
> fetch times). Then this rewriting tool will treat those files
> differently at checkout, creating (or updating) mutable objects for
> them. Other files which reference the mutable ones don't need to be
> updated when they change.
>
> When you introduce encryption, the same tool is used, except it dumps
> encrypted+hashed+(sometimes-)signed storage-index-named files into the
> output directory, instead of preserving the original filenames. The
> sanity-check would need to be given the readcaps (instead of working on
> the ciphertext, obviously), but would proceed the same way.
>
> The entire process could be automated to run each time you pushed a
> change to the production branch. Authors would be unaware of the process
> (except they'd get fewer complaints about http-used-in-https
> vulnerabilities). Web servers would be unaware of the process (they're
> just serving up weirdly-named files). End users (well, at least those
> who'd installed the plugin) would be mostly unaware of the process
> (they'd just see weird URLs in their status bar, but they're starting to
> get used to that anyways). If you stick with integrity (and not
> encryption), then end users with normal browsers are mostly unaware
> (they see the #hash=XYZ suffixes, if their status bar is wide enough).
>
> I've no idea how hard it would be to write this sort of plugin. But I'm
> pretty sure it's feasible, as would be the site-building tools. If
> firefox had this built-in, and web authors used it, what sorts of
> vulnerabilities would go away? What sorts of new applications could we
> build that would take advantage of this kind of security?
>
> thoughts?
>  -Brian
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
>