[tahoe-dev] a few random thoughts about scalability -- hardware
zooko
zooko at zooko.com
Fri Dec 21 12:15:40 PST 2007
One way to scale up a Tahoe grid is to have cheap commodity PC
servers with four SATA drives each. Each drive could be 1 TB.
Looking at a few components from my favorite component seller and
system builder, kc-computers.com, suggests that you could buy a
server like that for something on the order of $2000. We think you
can run a Tahoe node with a small fraction of an Athlon64's CPU
cycles and in no more than 128 MiB of RAM, so one of these servers
with 512 MiB of RAM and an Athlon64 should easily run four Tahoe
storage server nodes.
This approach is good because the components are individually cheap
and are commodity, but the resulting server is a little bit
unbalanced to be a Tahoe storage node, because its CPU is way more
powerful than you need to run four Tahoe nodes (one for each
spindle), but its total disk space of 4 TB is small -- with 4x
redundancy you would have to run 1000 of those servers to have a 1 PB
effective Tahoe grid.
So perhaps it would be better to get a 2U server with more SATA
drives. For example, I just priced a Tyan Transport TA26 with eight
1 TB SATA drives for $4206. This is twice the price and twice the
height for twice the number of 1 TB hard drives, so it isn't
obviously a win except that there are half as many motherboards,
CPUs, and RAM sets that you have to replace as they fail.
Interestingly, if you have more than four drives per server then you
could opt to do RAID-6 (or just RAID-5). With 8 drives, using RAID-6
increases the cost by 4/3, but greatly reduces the occurrence of
Tahoe storage node failure. The failure of a storage node due to
hard disk failure more or less disappears, and the only remaining
failures have to do with operator error, programmer error, some sort
of operating system or motherboard error that corrupts data, some
kind of electrical/heat/kinetic event that destroys three or more of
a server's drives at once, etc.. This sounds great, but on the other
hand, this increases the initial hardware costs from $2M to $2.66 M
and increases the number of servers from 500 to 667.
I really don't know anything about the costs of space, power, heat
management, and paying an army of sysadmins to run around replacing
failed parts of hundreds or thousands of servers.
Another option, if those considerations outweigh sheer price-per-
terabyte, is the Sun "Thumper" X4500 [1]. This offers up to 48 disks
in one server, where they price per terabyte is about $1300
(constrasted with about $500 for the 1U and 2U commodity servers, or
about $667 for the 2U servers with RAID-6). So the downside is the
price -- it costs about $5.2 M to buy enough Thumpers to store 1 PB
effective (4 PB raw -- before Tahoe erasure coding), instead of about
$2 M to buy enough 1U or 2U servers. The advantage is density -- it
takes only 84 Thumpers total, compared to 500 or 1000 commodity servers.
Again, I don't know enough about operational costs to evaluate the
advantage of this kind of density.
One interesting detail from Tahoe's perspective is that in any of
these approaches (but especially in the Thumper approach) the peer
selection algorithm ought to be extended to avoid storing multiple
shares of the same file on the same server.
The peer selection algorithm currently avoids storing multiple shares
of the same file on the same Tahoe storage node (unless absolutely
necessary -- if there aren't any other nodes available and with free
space). Suppose you had three colos, and each colo had 28 Thumpers
in it. The peer selection should be configured not to store more
shares from the same file into the same colo unless necessary. (It
would be necessary whemever n > 3, but our peer selection algorithm
also ensures an even distribution over all targets when there are
more shares than targets.) Also not to store more shares of the same
file onto the same Tahoe storage node, and also not to store more
shares of the same file onto the same Thumper. If k = 3 and n = 12,
then any single share failure can be repaired while using only in-
colo bandwidth, and to destroy a file would require the failure of
all three colos at once, or any set of 10 Thumpers at once, or some
sets of 10 Tahoe storage nodes at once.
Regards,
Zooko
[1] http://www.sun.com/servers/x64/x4500/
More information about the tahoe-dev
mailing list