[tahoe-dev] a few random thoughts about scalability -- hardware

Fri Dec 21 12:15:40 PST 2007

One way to scale up a Tahoe grid is to have cheap commodity PC  
servers with four SATA drives each.  Each drive could be 1 TB.   
Looking at a few components from my favorite component seller and  
system builder, kc-computers.com, suggests that you could buy a  
server like that for something on the order of $2000.  We think you  
can run a Tahoe node with a small fraction of an Athlon64's CPU  
cycles and in no more than 128 MiB of RAM, so one of these servers  
with 512 MiB of RAM and an Athlon64 should easily run four Tahoe  
storage server nodes.

This approach is good because the components are individually cheap  
and are commodity, but the resulting server is a little bit  
unbalanced to be a Tahoe storage node, because its CPU is way more  
powerful than you need to run four Tahoe nodes (one for each  
spindle), but its total disk space of 4 TB is small -- with 4x  
redundancy you would have to run 1000 of those servers to have a 1 PB  
effective Tahoe grid.

So perhaps it would be better to get a 2U server with more SATA  
drives.  For example, I just priced a Tyan Transport TA26 with eight  
1 TB SATA drives for $4206.  This is twice the price and twice the  
height for twice the number of 1 TB hard drives, so it isn't  
obviously a win except that there are half as many motherboards,  
CPUs, and RAM sets that you have to replace as they fail.

Interestingly, if you have more than four drives per server then you  
could opt to do RAID-6 (or just RAID-5).  With 8 drives, using RAID-6  
increases the cost by 4/3, but greatly reduces the occurrence of  
Tahoe storage node failure.  The failure of a storage node due to  
hard disk failure more or less disappears, and the only remaining  
failures have to do with operator error, programmer error, some sort  
of operating system or motherboard error that corrupts data, some  
kind of electrical/heat/kinetic event that destroys three or more of  
a server's drives at once, etc..  This sounds great, but on the other  
hand, this increases the initial hardware costs from $2M to $2.66 M  
and increases the number of servers from 500 to 667.

I really don't know anything about the costs of space, power, heat  
management, and paying an army of sysadmins to run around replacing  
failed parts of hundreds or thousands of servers.

Another option, if those considerations outweigh sheer price-per- 
terabyte, is the Sun "Thumper" X4500 [1].  This offers up to 48 disks  
in one server, where they price per terabyte is about $1300  
(constrasted with about $500 for the 1U and 2U commodity servers, or  
about $667 for the 2U servers with RAID-6).  So the downside is the  
price -- it costs about $5.2 M to buy enough Thumpers to store 1 PB  
effective (4 PB raw -- before Tahoe erasure coding), instead of about  
$2 M to buy enough 1U or 2U servers.  The advantage is density -- it  
takes only 84 Thumpers total, compared to 500 or 1000 commodity servers.

Again, I don't know enough about operational costs to evaluate the  
advantage of this kind of density.

One interesting detail from Tahoe's perspective is that in any of  
these approaches (but especially in the Thumper approach) the peer  
selection algorithm ought to be extended to avoid storing multiple  
shares of the same file on the same server.

The peer selection algorithm currently avoids storing multiple shares  
of the same file on the same Tahoe storage node (unless absolutely  
necessary -- if there aren't any other nodes available and with free  
space).  Suppose you had three colos, and each colo had 28 Thumpers  
in it.  The peer selection should be configured not to store more  
shares from the same file into the same colo unless necessary.  (It  
would be necessary whemever n > 3, but our peer selection algorithm  
also ensures an even distribution over all targets when there are  
more shares than targets.)  Also not to store more shares of the same  
file onto the same Tahoe storage node, and also not to store more  
shares of the same file onto the same Thumper.  If k = 3 and n = 12,  
then any single share failure can be repaired while using only in- 
colo bandwidth, and to destroy a file would require the failure of  
all three colos at once, or any set of 10 Thumpers at once, or some  
sets of 10 Tahoe storage nodes at once.

Regards,

Zooko

[1] http://www.sun.com/servers/x64/x4500/