[tahoe-dev] [tahoe-lafs] #999: amazon s3 backend
tahoe-lafs
trac at allmydata.org
Tue Mar 23 21:52:20 PDT 2010
#999: amazon s3 backend
--------------------------+-------------------------------------------------
Reporter: zooko | Owner:
Type: enhancement | Status: new
Priority: major | Milestone: undecided
Component: code-storage | Version: 1.6.0
Keywords: gsoc | Launchpad_bug:
--------------------------+-------------------------------------------------
Comment(by kevan):
(this is an email I sent to zooko a while ago with my thoughts on how this
should be implemented:)
First, I'll summarize, to make sure that I understand what you had in
mind. Please correct me if you disagree with any of this.
The "redundant array of inexpensive clouds" idea means extending the
current storage server in tahoe-lafs to support storage backends that
aren't what we have now (writing shares to the local filesystem). Well
actually, the redundant array of inexpensive clouds idea means doing
that, then implementing plugins for popular existing cloud storage
services -- Amazon S3 and Rackspace are two that you've mentioned, but
there are probably others (if we end up going through with this, I'll
probably email tahoe-dev so I can get an idea of what else is out
there/what else people want to see supported, in addition to my own
research).
The benefit (or at least the benefit that seems clear to me from your
explanation -- perhaps there are others that are more obvious if you run
a big tahoe-lafs installation like allmydata.com, or if you're more
familiar with tahoe-lafs than I am) is decoupling the ability of a
tahoe-lafs node to store files from its physical filesystem. So if, say,
allmydata.com were to start running tahoe-lafs nodes using S3 as a
backend, and their grid was filled, they could create more space on the
grid by buying more S3 buckets, rather than upgrading physical servers
or adding new servers (I've never used S3, but I would bet that it is
easier to buy more S3 buckets than to upgrade servers). Or, if you
wanted to create a grid without purchasing a bunch of servers, you could
run a bunch of nodes on one machine (I was thinking vmware images, but
then I started wondering whether it was even necessary to have that
level of separation between tahoe-lafs nodes -- is it? but that's not
really on topic), each mapping to a different S3 bucket or buckets.
Am I missing anything (aside from more examples)?
It seems like -- at least for S3 -- you could already sort of do this.
There are projects like s3fs, which provide a FUSE interface to an
S3 bucket (though the last file for it is more than a year old. it
seems like there should be other projects like that, though) (edit: this
is actually wrong -- I just hadn't found the Google code project, which is
at http://code.google.com/p/s3fs/). Using
that, you could mount your S3 bucket somewhere in the filesystem of your
server, then kajigger the basedir of the tahoe-lafs node so that it
rests in that area of the filesystem, or otherwise configure the
tahoe-lafs node to save files there. This requires more work than what
we'd eventually want with "redundant array of inexpensive clouds", of
course, and (depending on how well FUSE or other S3 interfaces play) may
only work on tahoe-lafs nodes running one unix or other, but if an
operator got it working, it seems like they'd have most of the benefit
outlined above without any further work on my/our part.
(not that I mind working on this, of course, but I figured it would be
worthwhile to mention that)
In any case, I think implementing this would come down to two basic parts.
The first part would be adapting the existing codebase to use multiple
backends.
We already have one backend -- the filesystem backend -- which I think
should be a plugin in the same sense that the others will be plugins
(i.e.: other code in tahoe-lafs can interact with a filesystem plugin
without caring very much about how or where it is storing its files --
otherwise it doesn't seem very extensible). If you accept this, then
we'd need to figure out what a backend plugin should look like. Maybe we
can make each plugin implement RIStorageServer, and leave it at that.
Then we might not need to do very much work on the existing server to
make it work with the rest of the (new) system. However, it's possible
that there is backend-independent logic in the current server
implementation that we wouldn't want to duplicate in every other backend
implementation. To address this, we could instead make a sort of
backend-agnostic storage server that implements RIStorageServer, then
make another interface for backends to implement, say IStorageProvider.
The skeletal RIStorageServer would instantiate its IStorageProvider
based on what the user configured, and use it to write/read data, get
statistics, and so on. Then IStorageProvider would be a fairly
simplistic filesystem-ish API.
The other part of preparation would be figuring out how to map user
configuration choices to what actually happens when a node is started.
Also, we'd want to figure out how (if?) we need to do anything special
with the credentials that users might need to log in to their storage
backend. I'll have a better idea of how I'd implement this once I look
at the way it works for other things that users configure.
Naturally, all of this would require a decent amount of documentation
and testing, too.
(I'm open to other ideas, of course -- these are just what came to my
mind)
Once we have all of this worked out, the rest of this project would be
identifying what other backends we'd want in tahoe-lafs, then
documenting, implementing, and testing those. We already have Amazon S3
and Rackspace as targets -- users of tahoe-lafs will probably have their
own suggestions, and more backends will come up with more research.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/999#comment:2>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list