Opened at 2010-03-16T16:03:05Z
Last modified at 2019-09-08T22:55:09Z
#999 closed enhancement
support multiple storage backends, including amazon s3 — at Version 5
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-storage | Version: | n/a |
Keywords: | s3-backend storage | Cc: | wilcoxjg@…, mk.fraggod@…, amontero@… |
Launchpad Bug: |
Description (last modified by davidsarah)
The focus of this ticket is (now) adapting the existing codebase to use multiple backends, rather than supporting any particular backend.
We already have one backend -- the filesystem backend -- which I think should be a plugin in the same sense that the others will be plugins (i.e.: other code in tahoe-lafs can interact with a filesystem plugin without caring very much about how or where it is storing its files -- otherwise it doesn't seem very extensible). If you accept this, then we'd need to figure out what a backend plugin should look like.
There is backend-independent logic in the current server implementation that we wouldn't want to duplicate in every other backend implementation. To address this, we could start by refactoring the existing code that reads or writes shares on disk, to use a local backend implementation supporting an IStorageProvider interface (probably a fairly simplistic filesystem-ish API).
(This involves changing the code in src/allmydata/storage/server.py that reads from local disk in its _iter_share_files() method, and also changing storage/shares.py, storage/immutable.py, and storage/mutable.py that write shares to local disk.)
At this point all the existing tests should still pass, since we haven't actually changed the behaviour.
Then we have to add the ability to configure new storage providers. This involves figuring out how to map user configuration choices to what actually happens when a node is started, and how the credentials needed to log into a particular storage backend should be specified. The skeletal RIStorageServer would instantiate its IStorageProvider based on what the user configured, and use it to write/read data, get statistics, and so on.
Naturally, all of this would require a decent amount of documentation and testing, too.
Once we have all of this worked out, the rest of this project (probably to be handled in other tickets) would be identifying what other backends we'd want in tahoe-lafs, then documenting, implementing, and testing them. We already have Amazon S3 and Rackspace as targets -- users of tahoe-lafs will probably have their own suggestions, and more backends will come up with more research.
Change History (5)
comment:1 Changed at 2010-03-16T16:03:35Z by zooko
comment:2 Changed at 2010-03-24T04:52:20Z by kevan
(this is an email I sent to zooko a while ago with my thoughts on how this should be implemented:)
First, I'll summarize, to make sure that I understand what you had in mind. Please correct me if you disagree with any of this.
The "redundant array of inexpensive clouds" idea means extending the current storage server in tahoe-lafs to support storage backends that aren't what we have now (writing shares to the local filesystem). Well actually, the redundant array of inexpensive clouds idea means doing that, then implementing plugins for popular existing cloud storage services -- Amazon S3 and Rackspace are two that you've mentioned, but there are probably others (if we end up going through with this, I'll probably email tahoe-dev so I can get an idea of what else is out there/what else people want to see supported, in addition to my own research).
The benefit (or at least the benefit that seems clear to me from your explanation -- perhaps there are others that are more obvious if you run a big tahoe-lafs installation like allmydata.com, or if you're more familiar with tahoe-lafs than I am) is decoupling the ability of a tahoe-lafs node to store files from its physical filesystem. So if, say, allmydata.com were to start running tahoe-lafs nodes using S3 as a backend, and their grid was filled, they could create more space on the grid by buying more S3 buckets, rather than upgrading physical servers or adding new servers (I've never used S3, but I would bet that it is easier to buy more S3 buckets than to upgrade servers). Or, if you wanted to create a grid without purchasing a bunch of servers, you could run a bunch of nodes on one machine (I was thinking vmware images, but then I started wondering whether it was even necessary to have that level of separation between tahoe-lafs nodes -- is it? but that's not really on topic), each mapping to a different S3 bucket or buckets.
Am I missing anything (aside from more examples)?
It seems like -- at least for S3 -- you could already sort of do this. There are projects like s3fs, which provide a FUSE interface to an S3 bucket (though the last file for it is more than a year old. it seems like there should be other projects like that, though) (edit: this is actually wrong -- I just hadn't found the Google code project, which is at http://code.google.com/p/s3fs/). Using that, you could mount your S3 bucket somewhere in the filesystem of your server, then kajigger the basedir of the tahoe-lafs node so that it rests in that area of the filesystem, or otherwise configure the tahoe-lafs node to save files there. This requires more work than what we'd eventually want with "redundant array of inexpensive clouds", of course, and (depending on how well FUSE or other S3 interfaces play) may only work on tahoe-lafs nodes running one unix or other, but if an operator got it working, it seems like they'd have most of the benefit outlined above without any further work on my/our part.
(not that I mind working on this, of course, but I figured it would be worthwhile to mention that)
In any case, I think implementing this would come down to two basic parts.
The first part would be adapting the existing codebase to use multiple backends.
We already have one backend -- the filesystem backend -- which I think should be a plugin in the same sense that the others will be plugins (i.e.: other code in tahoe-lafs can interact with a filesystem plugin without caring very much about how or where it is storing its files -- otherwise it doesn't seem very extensible). If you accept this, then we'd need to figure out what a backend plugin should look like. Maybe we can make each plugin implement RIStorageServer, and leave it at that. Then we might not need to do very much work on the existing server to make it work with the rest of the (new) system. However, it's possible that there is backend-independent logic in the current server implementation that we wouldn't want to duplicate in every other backend implementation. To address this, we could instead make a sort of backend-agnostic storage server that implements RIStorageServer, then make another interface for backends to implement, say IStorageProvider. The skeletal RIStorageServer would instantiate its IStorageProvider based on what the user configured, and use it to write/read data, get statistics, and so on. Then IStorageProvider would be a fairly simplistic filesystem-ish API.
The other part of preparation would be figuring out how to map user configuration choices to what actually happens when a node is started. Also, we'd want to figure out how (if?) we need to do anything special with the credentials that users might need to log in to their storage backend. I'll have a better idea of how I'd implement this once I look at the way it works for other things that users configure.
Naturally, all of this would require a decent amount of documentation and testing, too.
(I'm open to other ideas, of course -- these are just what came to my mind)
Once we have all of this worked out, the rest of this project would be identifying what other backends we'd want in tahoe-lafs, then documenting, implementing, and testing those. We already have Amazon S3 and Rackspace as targets -- users of tahoe-lafs will probably have their own suggestions, and more backends will come up with more research.
comment:3 Changed at 2010-03-31T16:48:51Z by davidsarah
- Description modified (diff)
- Keywords backend s3 added
- Summary changed from amazon s3 backend to support multiple storage backends, including amazon s3
Generalizing this to include support for multiple backends (since I don't think we want to do it in a way that would only support S3 and local disk).
comment:5 Changed at 2010-03-31T17:17:57Z by davidsarah
- Description modified (diff)
Update description to reflect kevan's suggested approach.
See the RAIC diagram.