[tahoe-dev] Grid Design Feedback

Mon Jun 27 09:44:44 PDT 2011

> Another solution is to add more nodes but to increase N and K.  At the
> extreme, if you keep N set to the number of nodes in the system (and
> could get all files to be updated to have N shares), then the allmydata
> problem couldn't happen because all files in the system would live or
> die together.  You can't actually do that, but if you keep increasing N
> and K you can ensure that for files added later it would take a very
> large number of simultaneous failures to make files unavailable.
> Allmydata used N=10,K=3, so any time 7 servers (out of hundreds!) were
> down it was likely to knock some files out.  If those were dircaps, the
> problem was particularly nasty.
> 
> However, if you just increase N and K you'll have a problem that the
> dircaps -- the most important files, from a reliability perspective --
> will still have their original values.  As you update the directories,
> I believe that the shares will shift around the nodes, but they won't
> get any more shares... unless you use immutable directories which are
> obviously created with each update. Can that be done via the sftp
> interface?

Not sure if it can be done.  Probably not?

> Just my opinion, but I think this approach ignores the strengths of
> Tahoe.  Ignoring the RAID-1 for the moment, and supposing 98%
> reliability of the servers (probably conservative), that gives you a
> 0.04% probability of file loss at the cost of a 100% expansion factor.
> If, instead, you were to set K=5, N=10 (or perhaps K=4,N=8, to avoid
> write failures when a couple of machines are down), for the same
> expansion factor you get orders of magnitude lower probability of file
> loss.  And I'd also consider skipping the RAID-1 and instead running
> two Tahoe servers on each machine, one per disk.

RAID-1 - I figured this might be an objection.  This is an enterprise grid, and disks are cheap (and unreliable).  Maybe I'm stuck in regular-filesystem land, but my instinct is to build reliability in from the block-level up.  I may do some tinkering with a copy of the grid without RAID, but the corporate pressure is to build it with RAID, so I think we'll probably stick with it.

I'll be open to running more than 1 tahoe server per box once share distribution control is available, and I can tell Tahoe that two shares of one file should never be on the same physical server.

Expansion factor - again, I may be stuck in regular-filesystem-land, but simple replication (for files) -feels- better to me.

One thing I have been thinking about in relation to the AMD issue.  Since directories are so important, why not handle them differently than files?  For example, would it really be too expensive to store all the directory shares on all the storage nodes in a K=1, N=$gridsize$ manner?  It just seems that this is basic filesystem metadata that should be MORE resilient than the files themselves.  They're tiny, so who -really- cares if they're stored more diversely than files?

Nathan