<div dir="ltr"><div><div><div><div><div><div>Hi, I've started working on a project that's similar to
Tahoe-LAFS, in that it's a distributed cluster of machines hosting a
bunch of files.<br><br></div>Erasure coding is important, but I've been
having trouble learning about the different types of erasure coding to
be confident that I'll be able to pick the best for my needs. I was
hoping you could point to some links, papers, or otherwise help out.<br>
<br></div><div>Ideally (in order of importance):<br><br></div><div>+ machines can participate in many clusters simultaneously (and many clusters can exist simultaneously)<br></div><div>+ 100s - 1000s of machines per cluster<br>
+
if a machine corrupts or is otherwise lost, it's portion of the file
can be replaced quickly and without using many network resources<br>+ low redundancy<br><br>less important but still important:<br><br></div><div>+ cluster size can expand or shrink dynamically<br></div><div>+ some machines only need to be online some of the time<br>
<br></div><div>I've looked at Reed-Solomon coding, which seems to be useful but not ideal (too expensive to replace lost nodes)<br></div><div>I've also looked at raptor codes, which seem promising but I don't understand them, and there seem to be patent issues.<br>
<br></div><div>In general, I've been unsuccessful at finding resources to learn about erasure codes, but persistence has been slowly turning up useful resources.<br></div></div></div></div></div></div>