[tahoe-dev] Tahoe-LAFS and Ceph: Some remarks

Sun Mar 3 18:40:39 UTC 2013

This post contains two others, one sent to Zooko while ago but that I was 
foiled in posting to the list afterwards by Gmane using the old allmydata 
list address, and one which I posted as a reply to a G+ post Zooko made a 
while ago. Posting to the list as he requested.

First message: CRUSH and Tahoe-LAFS
---cut---
Hey, I was reading through the Tahoe-LAFS FAQ, and while looking into the
one about controlling replication based on topology (the wiki page, tickets,
etc), I noticed that there didn't seem to be any mention of CRUSH, which the
Ceph cluster filesystem (or rather, its distributed object store RADOS) uses
for this. Figured it might be worthwhile to toss you a link in case you
hadn't seen it: http://ceph.com/papers/weil-crush-sc06.pdf

Ceph/RADOS is a solution to a different problem than Tahoe-LAFS, but CRUSH
is interesting for the cases listed at the top of the wiki page because as
long as the client has a copy of the crushmap, then computing where
something goes (or comes from) is a purely local operation.

Since the crushmap is user specified, and the placement is then generated
based on it, it lets users describe their topology and policies and then
just lays the data out accordingly.
---cut---

Second message: Overall design similarities between Ceph and Tahoe-LAFS
In reply to https://plus.google.com/108313527900507320366/posts/ZrgdgLhV3NG
May ramble a bit.
---cut---
Apologies for commenting on such an ancient post, but I figured I'd drop 
some info about Ceph here. Yes, I'm the same guy who sent the email about 
CRUSH - I happened to come across this via the post to freedombox-devel on 
not being a filesystem.

Anyway, Ceph does have something roughly analogous to introducers, called 
Monitors or MON nodes. They also manage the PAXOS consistency stuff IIRC. 
Ceph manages clients connecting to the cluster by letting the client pick a 
monitor, any monitor, at which point it bootstraps to more knowledge of the 
cluster. One thing it does is let the client know about the other monitors, 
so even if the one the client used to connect dies, nothing bad happens 
(unless there aren't enough left to keep PAXOS happy, that is). Monitors are 
actually pretty close to what Ticket 68 seems to be hoping for, aside from 
being a separate node type instead of being on every node.

I think you might find a lot of Ceph's design interesting - especially from 
the perspective of scaling Tahoe. For one, looking at it thinking of it as a 
filesystem actually misses a lot of its capabilities. The part of Ceph 
that's really fascinating is the underlying object store, RADOS.

It's surprisingly close to Tahoe, as a matter of fact - placement of objects 
can be computed on any node via a function, so the client can know where 
stuff is going without talking to some sort of central server or DHT (they 
have an optimization that the client puts it on one OSD and lets it do the 
distribution, but that's an implementation choice and not a core design 
element).

The Ceph MDS nodes aren't part of RADOS - their big role is putting POSIX on 
top of the object store, and doing some fancy caching/load balancing of 
(posixy) metadata for performance.

RADOS itself is a object storage cluster that replicates data among a 
configurable arrangement of nodes, often accessed through a gateway that 
makes the protocol look like S3 or Swift, and which has clustered 
introducers. Aside from encryption, it means that some of the things on 
Tahoe's wiki and proposed enhancements list look kinda familiar at times...

I'd recommend checking out the 2013 Linux.conf.au talks on Ceph - the one on 
OpenStack goes over some of the other non-POSIX-fs ways they're using the 
underlying object store, like thin-provisioned network block devices.
---cut---