[tahoe-dev] Tahoe-LAFS and Ceph: Some remarks
Alex Elsayed
eternaleye at gmail.com
Sun Mar 3 18:40:39 UTC 2013
This post contains two others, one sent to Zooko while ago but that I was
foiled in posting to the list afterwards by Gmane using the old allmydata
list address, and one which I posted as a reply to a G+ post Zooko made a
while ago. Posting to the list as he requested.
First message: CRUSH and Tahoe-LAFS
---cut---
Hey, I was reading through the Tahoe-LAFS FAQ, and while looking into the
one about controlling replication based on topology (the wiki page, tickets,
etc), I noticed that there didn't seem to be any mention of CRUSH, which the
Ceph cluster filesystem (or rather, its distributed object store RADOS) uses
for this. Figured it might be worthwhile to toss you a link in case you
hadn't seen it: http://ceph.com/papers/weil-crush-sc06.pdf
Ceph/RADOS is a solution to a different problem than Tahoe-LAFS, but CRUSH
is interesting for the cases listed at the top of the wiki page because as
long as the client has a copy of the crushmap, then computing where
something goes (or comes from) is a purely local operation.
Since the crushmap is user specified, and the placement is then generated
based on it, it lets users describe their topology and policies and then
just lays the data out accordingly.
---cut---
Second message: Overall design similarities between Ceph and Tahoe-LAFS
In reply to https://plus.google.com/108313527900507320366/posts/ZrgdgLhV3NG
May ramble a bit.
---cut---
Apologies for commenting on such an ancient post, but I figured I'd drop
some info about Ceph here. Yes, I'm the same guy who sent the email about
CRUSH - I happened to come across this via the post to freedombox-devel on
not being a filesystem.
Anyway, Ceph does have something roughly analogous to introducers, called
Monitors or MON nodes. They also manage the PAXOS consistency stuff IIRC.
Ceph manages clients connecting to the cluster by letting the client pick a
monitor, any monitor, at which point it bootstraps to more knowledge of the
cluster. One thing it does is let the client know about the other monitors,
so even if the one the client used to connect dies, nothing bad happens
(unless there aren't enough left to keep PAXOS happy, that is). Monitors are
actually pretty close to what Ticket 68 seems to be hoping for, aside from
being a separate node type instead of being on every node.
I think you might find a lot of Ceph's design interesting - especially from
the perspective of scaling Tahoe. For one, looking at it thinking of it as a
filesystem actually misses a lot of its capabilities. The part of Ceph
that's really fascinating is the underlying object store, RADOS.
It's surprisingly close to Tahoe, as a matter of fact - placement of objects
can be computed on any node via a function, so the client can know where
stuff is going without talking to some sort of central server or DHT (they
have an optimization that the client puts it on one OSD and lets it do the
distribution, but that's an implementation choice and not a core design
element).
The Ceph MDS nodes aren't part of RADOS - their big role is putting POSIX on
top of the object store, and doing some fancy caching/load balancing of
(posixy) metadata for performance.
RADOS itself is a object storage cluster that replicates data among a
configurable arrangement of nodes, often accessed through a gateway that
makes the protocol look like S3 or Swift, and which has clustered
introducers. Aside from encryption, it means that some of the things on
Tahoe's wiki and proposed enhancements list look kinda familiar at times...
I'd recommend checking out the 2013 Linux.conf.au talks on Ceph - the one on
OpenStack goes over some of the other non-POSIX-fs ways they're using the
underlying object store, like thin-provisioned network block devices.
---cut---
More information about the tahoe-dev
mailing list