[tahoe-dev] [tahoe-lafs] #999: amazon s3 backend

Tue Mar 23 21:52:20 PDT 2010

#999: amazon s3 backend
--------------------------+-------------------------------------------------
 Reporter:  zooko         |           Owner:           
     Type:  enhancement   |          Status:  new      
 Priority:  major         |       Milestone:  undecided
Component:  code-storage  |         Version:  1.6.0    
 Keywords:  gsoc          |   Launchpad_bug:           
--------------------------+-------------------------------------------------

Comment(by kevan):

 (this is an email I sent to zooko a while ago with my thoughts on how this
 should be implemented:)

 First, I'll summarize, to make sure that I understand what you had in
 mind. Please correct me if you disagree with any of this.

 The "redundant array of inexpensive clouds" idea means extending the
 current storage server in tahoe-lafs to support storage backends that
 aren't what we have now (writing shares to the local filesystem). Well
 actually, the redundant array of inexpensive clouds idea means doing
 that, then implementing plugins for popular existing cloud storage
 services -- Amazon S3 and Rackspace are two that you've mentioned, but
 there are probably others (if we end up going through with this, I'll
 probably email tahoe-dev so I can get an idea of what else is out
 there/what else people want to see supported, in addition to my own
 research).

 The benefit (or at least the benefit that seems clear to me from your
 explanation -- perhaps there are others that are more obvious if you run
 a big tahoe-lafs installation like allmydata.com, or if you're more
 familiar with tahoe-lafs than I am) is decoupling the ability of a
 tahoe-lafs node to store files from its physical filesystem. So if, say,
 allmydata.com were to start running tahoe-lafs nodes using S3 as a
 backend, and their grid was filled, they could create more space on the
 grid by buying more S3 buckets, rather than upgrading physical servers
 or adding new servers (I've never used S3, but I would bet that it is
 easier to buy more S3 buckets than to upgrade servers). Or, if you
 wanted to create a grid without purchasing a bunch of servers, you could
 run a bunch of nodes on one machine (I was thinking vmware images, but
 then I started wondering whether it was even necessary to have that
 level of separation between tahoe-lafs nodes -- is it? but that's not
 really on topic), each mapping to a different S3 bucket or buckets.

 Am I missing anything (aside from more examples)?

 It seems like -- at least for S3 -- you could already sort of do this.
 There are projects like s3fs, which provide a FUSE interface to an
 S3 bucket (though the last file for it is more than a year old. it
 seems like there should be other projects like that, though) (edit: this
 is actually wrong -- I just hadn't found the Google code project, which is
 at http://code.google.com/p/s3fs/). Using
 that, you could mount your S3 bucket somewhere in the filesystem of your
 server, then kajigger the basedir of the tahoe-lafs node so that it
 rests in that area of the filesystem, or otherwise configure the
 tahoe-lafs node to save files there. This requires more work than what
 we'd eventually want with "redundant array of inexpensive clouds", of
 course, and (depending on how well FUSE or other S3 interfaces play) may
 only work on tahoe-lafs nodes running one unix or other, but if an
 operator got it working, it seems like they'd have most of the benefit
 outlined above without any further work on my/our part.

 (not that I mind working on this, of course, but I figured it would be
 worthwhile to mention that)

 In any case, I think implementing this would come down to two basic parts.

 The first part would be adapting the existing codebase to use multiple
 backends.

 We already have one backend -- the filesystem backend -- which I think
 should be a plugin in the same sense that the others will be plugins
 (i.e.: other code in tahoe-lafs can interact with a filesystem plugin
 without caring very much about how or where it is storing its files --
 otherwise it doesn't seem very extensible). If you accept this, then
 we'd need to figure out what a backend plugin should look like. Maybe we
 can make each plugin implement RIStorageServer, and leave it at that.
 Then we might not need to do very much work on the existing server to
 make it work with the rest of the (new) system. However, it's possible
 that there is backend-independent logic in the current server
 implementation that we wouldn't want to duplicate in every other backend
 implementation. To address this, we could instead make a sort of
 backend-agnostic storage server that implements RIStorageServer, then
 make another interface for backends to implement, say IStorageProvider.
 The skeletal RIStorageServer would instantiate its IStorageProvider
 based on what the user configured, and use it to write/read data, get
 statistics, and so on. Then IStorageProvider would be a fairly
 simplistic filesystem-ish API.

 The other part of preparation would be figuring out how to map user
 configuration choices to what actually happens when a node is started.
 Also, we'd want to figure out how (if?) we need to do anything special
 with the credentials that users might need to log in to their storage
 backend. I'll have a better idea of how I'd implement this once I look
 at the way it works for other things that users configure.

 Naturally, all of this would require a decent amount of documentation
 and testing, too.

 (I'm open to other ideas, of course -- these are just what came to my
 mind)

 Once we have all of this worked out, the rest of this project would be
 identifying what other backends we'd want in tahoe-lafs, then
 documenting, implementing, and testing those. We already have Amazon S3
 and Rackspace as targets -- users of tahoe-lafs will probably have their
 own suggestions, and more backends will come up with more research.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/999#comment:2>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid