[tahoe-dev] Bringing Tahoe ideas to HTTP

Thu Aug 27 23:43:01 PDT 2009

Michael Walsh wrote:

>  - Adding a HTTP header with this data but requires something like a
> server module or output script. It also doesn't ugly up the URL (but
> then again, we have url shortner services for manual typing).

Ah, but see, that loses the security. If the URL doesn't contain the
root hash, then you're depending upon somebody else for your
authentication, and then it's not end-to-end anymore. URL shortener
services are great for the location properties, but lousy for the
identification properties. Not only are you relying upon your DNS, and
your network, and every other network between you and the server, and
the server who's providing you with the data.. now you're also dependent
upon the operator of the shortener service, and their database, and
anyone who's managed to break into their database, etc.

I guess there are three new things to add:

 * secure identifier in the URL
 * an upstream request, to say what additional integrity information you
   want
 * a downstream response, to provide that additional integrity
   information

The stuff I proposed used extra HTTP requests for those last two.

But, if you use the HTTP request headers to ask for the extra integrity
metadata, you could use the HTTP response headers to convey it. The only
place where this would make sense would be to fetch the merkle tree, or
the signature. (if the file was small and you only check the flat hash,
then there's nothing else to fetch; if the file is encrypted, then you
can define its data layout to be whatever you like, and just include the
integrity information in it directly).

Oh, and that would make the partial-range request a lot simpler: the
client does a GET with a "Range: 123-456" header, and the server looks
at the associated merkle tree, figures out which chain they'll need to
validate those bytes, and returns a header that includes all of those
hash tree nodes (and the segment size, filesize). And returns enough
file data to cover those segments (i.e. the Content-Range: response
would be for a larger range than the Range: request, basically rounded
up to a segment size). The client would hash the segments it receives,
build and verify the merkle chain, then compare the root of the chain
against the roothash in the URL and make sure they match.

The response headers might get a bit large:
log2(filesize/segsize)*hashsize . And you have to fetch at least a full
segsize. But that's predictable and fairly well bounded, so maybe it
isn't that big of a deal.

Doing this with HTTP headers instead of a separate GET would avoid a
roundtrip, since the server (which now takes an active role in the
process) can decide these things (like which hash tree nodes are needed)
on behalf of the client. Instead of the client pulling one file for the
data, then pulling part of another file to get the segment size (so it
can figure out which segments it wants), then pulling a different part
of that file to get the hash nodes... the client just does a GET Range:,
and the server figures out the rest.

As you said, it requires a server module. But compared to a client-side
plugin, that's positively easy :). It'd probably be a good idea to think
about a scheme that would take advantage of a server which had
additional capabilities like this. Maybe the client could try the header
thing first, and if the response didn't indicate that the server is able
to provide that data, go and fetch the .hashtree file.

> My thoughts purely turn to verifying files and all webpage resources
> integrity in a transparent and backward compatible way. Who has not
> encountered unstable connections where images get corrupted and css
> files don't fully load? Solving that problem would make me very happy!

Yeah. We've been having an interesting thread on tahoe-dev recently
about the backwards-compatible question, what sorts of failure modes are
best to aim for when you're faced with an old server or whatever.

You can't make this stuff totally transparent and backwards compatible
(in the sense that existing webpages and users should start benefiting
from it without doing any work).. I think that's part of the brokenness
of the SSL+CA model, where they wanted the only change to be starting
from "https" instead of "http". But you can certainly create a framework
that lets people get better control over what they're loading. Now, if
only distributed programs could be written in languages with those sorts
of properties...

cheers,
 -Brian