[tahoe-dev] storage-club URLs
Ravi Pinjala
ravi at p-static.net
Tue Feb 22 19:32:07 PST 2011
On Tue, Feb 22, 2011 at 5:10 PM, Brian Warner <warner at lothar.com> wrote:
> On 2/22/11 3:38 PM, Greg Troxel wrote:
>>
>> I understand your point about how grids might be organized, but I
>> don't follow "moving to". The pubgrid is anomalous, and there are
>> volunteergrid, volunteergrid2, plus numerous unadvertised grids. So it
>> seems like we're already there.
>
> Yeah, I'm talking about making "membership in a grid" to be more
> distinct. We currently have no grid IDs, just the introducer.furl, and
> we really want to get rid of that. When there is no introducer to
> control things (i.e. its functionality is distributed out among all
> members of the grid), we need some other definition of membership, which
> means things like:
>
> 1: which servers I, as a client, should trust with my shares
> 2: which clients I, as a server, should accept shares from
> 3: which gateways I, as a downloader, should get shares from
>
>
>> tahoe invite user at example.com
>>
>> which packages up the grid params and sends an OpenPGP signed and
>> encrypted mail with the data, then that sounds cool; now you have to
>> do that by hand.
>
> Exactly. The "invitation" idea is just that, except hopefully "tahoe
> invite" will return a single short "invitation code", and the recipient
> will paste it in with "tahoe accept-invite $CODE", and then the two
> nodes will find each other and exchange keys and whatnot and eventually
> know about each other and everyone transitively connected to them. (this
> depends briefly upon having a pre-shared broadcast channel, maybe
> through a tahoe-lafs.org coordination server).
>
>> Here I was initially very skeptical, as I have never really understood
>> the tahoe community's conflation of storage and publishing. Perhaps
>> that's because
>>
>> * I view local storage as essentially free, and reliable and
>> survivable storage as hard.
>>
>> * I associate the "publishing in tahoe" approach with using a
>> particular pubgrid gateway, so it isn't any more reliable than a
>> traditiional server
>
> Yeah, that's the thing I want to fix. The big obvious problem with
> publishing URLs that start with http://pubgrid.tahoe-lafs.org/ is that
> they depend upon that one webapi host, in addition to a quorum of
> storage servers, and DNS. Those URLs are *less* reliable than an
> ordinary apache server with a local file in /var/www .
>
> OTOH, a tahoe filecap (if you have the software to use it) is much more
> reliable than that URL: you don't need DNS, you run your own gateway,
> and you only need a quorum of storage servers to be reachable.
>
> The URLs I'm proposing *could* be more reliable than
> http://pubgrid.tahoe-lafs.org/ URLs. On the plus side, there could be
> multiple gateways. On the minus side, there's the DNS dispatcher.
>
>> * I have the impression the publish-in-tahoe activities to date are
>> as much a tahoe marketing activity as they are genuinely useful.
>
> Yeah. The dream of an "unhosted wiki" or "cloudapp" depends upon a
> protocol that lets you transparently fail over to alternate servers,
> which ordinary browsers can already speak. I think round-robin DNS
> records is the closest we currently have. Maybe someone will figure out
> how to run HTTP over IPv6 anycast addresses or something..
>
>
> Anyways, one of my unstated goals is to make it easier for individuals
> to publish data on the web. Today, if you have something you want to
> share with the world (or even just some friends), how do you do it?
> Personally, I'd probably scp it up to www.lothar.com, and hand out the
> resulting URL, but I'm lucky/crazy/stupid enough to pay a nontrivial
> amount of money each month to rent a box in colo with 24/7 connectivity.
> Most people would give it to Facebook and ask them for a URL to give to
> people. Or they'd look at the type of the thing being shared and put the
> images on flickr, or the spreadsheets on Google Docs, or use one of a
> few dozen datatype-specific hosting providers, all of which start by
> requiring an account setup process, show you ads or crawl your content
> or find other ways to recoup the costs involved in hosting your stuff.
> And all of them are susceptible to control by somebody you've never met.
>
> One of the costs of hosting that data is the disk space, but as you
> pointed out, local disk is cheap. Another cost is outbound bandwidth,
> but everybody connected to the internet has a little bit. Another cost
> is having a server available at the same time your potential downloader
> wants to see your file, which can be addressed with either a dedicated
> machine (maybe a small guru/pogo/shiva-plug), or a collection of
> machines that hand off responsibility over the course of the day.
>
> And the biggest cost, particularly for small sites, is organizational:
> the tools to manage the server, providing the easy-to-use upload form,
> the search forms, the how-to-delete-my-files forms, the edit-my-pictures
> forms. You have to pay this cost before you get a single file online, so
> for an individual who just wants to share pictures of their grandkids,
> it dwarfs all the others. My grandparents would never build a linux box
> and rent colo space to show me their pictures. But they might install an
> easy-to-use Tahoe Storage Club(tm) program.
>
> And, as I've mentioned before, I think we can address the limited
> bandwidth/uptime of home machines by letting people augment their grids
> with rented professional storage. But for small usage, that's hardly a
> requirement. And I think there are a lot of people who would like to be
> able to host data all by themselves, or collectively with a couple of
> friends, and retain control over it (i.e. minimize external
> dependencies). A bunch of the solutions depend upon a DNS dispatcher,
> but the cost to run that are pretty low, so I think we fund one with
> donations and not wind up with a users-are-product parasitic ad-based
> service like almost every dot.com out there.
>
> In summary, I think I'm looking to provide for the low-end of the
> publishing spectrum, which is currently expensive enough that most
> people are forced to use a "free" service (ironically enough).
>
>> After reading the rest of your message, I think what you're proposing
>> makes more sense, and it needs situations with one of two properties:
>>
>> government suppression of publication
>>
>> massive datasets with infrequent access, such that putting them on a
>> regular server is infeasible.
>>
>> I think the first use case is more compelling, and then the entire
>> system has to be designed against that threat model.
>>
>> I wonder if first addressing distributed introducers is in order; it
>> would seem that some DHT-type scheme for within-grid discovery might
>> also work for the publication scenario, and the tahoe-lafs.org dyndns
>> server becomes a single point of failure -- your scheme would not have
>> worked in Egypt.
>
> Yeah, I think the attack-tolerance depends upon how much software your
> clients are willing to install. If you had IP connectivity but not DNS,
> then a big DHT could be used to build an overlay network that gets you
> to all the storage servers, and then you can do normal tahoe downloads
> over that. A vanilla browser won't be able to take advantage of that,
> but if we could make a Firefox plugin that knows how to speak Tahoe and
> this DHT, then maybe we could survive stuff like DNS takedowns.
>
> I suspect that massive datasets are going to need serious servers.
>
>> There's another thorny problem lurking, which is that storage
>> accounting isn't sufficient. I've been talking to people as I head
>> towards a private grid, and one person was concerned about network
>> usage. Someone here recently expressed concern about a 60G/month usage
>> cap, and this is an obvious issue when shopping for a VPS provider. So
>> one really needs accounting for total data transfers.
>
> Yeah, good point. Part of the collection-of-gateways scheme needs to
> keep track of how much bandwidth is used by people downloading different
> files (indexed by the publisher), so the participants can monitor and
> control how their machines are being used. Bandwidth is another commons
> that needs to be managed, just like storage space. I'll add that to my
> notes.
>
>> I wonder if your proposal has somewhat the same properties as a
>> regular website as a tor hidden service. Instead, the server is
>> distributed and it's not imediately obvious who put the bits there.
>
> Hm, yeah. Like Tor hidden services, access depends upon the
> contributions of many people (the various gateways you end up
> traversing). Tor is running in "one grid to rule them all" -style, with
> a number of gateway-reliability monitoring tools to try and apply some
> mechanical and/or social pressure to keep the grid working well. I
> should read up on what their tools are like and how well they're coping
> with the tradgedy-of-the-commons effect that large groups and anonymity
> usually cause.
>
> thanks!
> -Brian
There is a lot about this that I like. :D A few questions:
- When you talk about the tahoe-lafs.org dyndns service, you really
mean a federated system where anybody can host a grid on their own
domain, right? :) I feel like that's probably what you mean, but it's
not actually clear from the text.
- In that same vein, it'd be really cool if grids could use multiple
independent dyndns services, running on multiple domains. (Ideally,
they'd be domains from multiple countries, so that the grid isn't
dependent on any one country.)
- When you talk about clients recognizing the "tahoe-ness" of URLs,
what did you have in mind? Pattern-matching on the URL would be pretty
easy to implement, but fragile; the best solution I could think of
would be a Tahoe-specific TXT record on the domain, but that raises
the bar a bit for running a dyndns dispatcher. Using tahoe:// as the
URL scheme would be pretty cool, but I suppose it wouldn't work for
unmodified HTTP clients.
- Having a shared SSL key makes me nervous, if the key is going to be
encoded into publicly-visible URLs. It seems like it would be very
difficult to change the key if it leaked. Maybe a multilevel scheme
could work, where there's a secret key held by a few trusted grid
members, and individual gateways have unique private keys that are
signed by the secret grid key.
--Ravi
More information about the tahoe-dev
mailing list