[tahoe-dev] erasure coding makes files more fragile, not less

Tue Mar 27 19:06:37 UTC 2012

Folks:

I've heard many stories of people losing their files from a Tahoe-LAFS
grid even though they had erasure coding parameters that provide
massive fault tolerance such as 3-of-10 or 4-of-8. In fact, I think
approximately 90% of all files that have ever been stored on a
Tahoe-LAFS grid have died. (That's excluding all of the files of all
of the customers of allmydata.com, which went out of business.)

I've been musing on this, and I just read this excellent blog rant by
the original author of Voldemort—Jay Kreps. I came up with this
provocative slogan (I know Brian loves my provocative slogans):
"erasure coding makes files more fragile, not less".

The idea behind that is that erasure coding lulls people into a false
sense of security. If K=N=1, or even if K=1 and N=2 (which is the same
fault tolerance as RAID-1), then people understand that they need to
constantly monitor and repair problems as they arise. But if K=3 and
N=10, then the beautiful combinatorial math tells you that your file
has lots of "9's" of reliability. The beautiful combinatorial math
lies! That's because it is assuming each server has some fixed and
independent chance of surviving, which is always false. ("90%" is
always a good number to use for that fixed and independent chance.
Plug in "90%" into the beautiful combinatorial math with K=3 and N=10
and you'll get more "9's" than you can shake a stick at!)

Here's the excellent blog rant:

http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability

"""
Where is the flaw in the reasoning?
…
The problem is the assumption that failures are independent.
…
Surely no belief could possibly be more counter to our own experience
or just common sense than believing that there is no correlation
between failures of machines in a cluster.
…
The actual reliability of your system depends largely on how bug free
it is, how good you are at monitoring it, and how well you have
protected against the myriad issues and problems it has. This isn’t
any different from traditional systems, except that the new software
is far less mature.
"""

Now let's apply this idea to my empirical observations about the
longevity of files stored in Tahoe-LAFS. If almost all of the files
that have ever been stored on Tahoe-LAFS have died, this implies one
of two things:

1. The "reliability" of the storage servers must have been below K/N.
I.e. if a file was stored with 3-of-10 encoding, but if each storage
server had a 75% chance of dying, then the file would be *more* likely
to die due to the erasure coding, rather than less likely to die,
because a 75% chance of dying, a.k.a. a 25% chance of staying alive,
is worse than the 30% number of shares required to recover the file.

or

2. The behavior of storage servers must not have been *independent*.
I.e. if enough of the servers failed *at once*, then the file died,
even if the chance of any individual server failing was lower than the
erasure coding ratio.

My conclusion: if you care about the longevity of your files, forget
about erasure coding and concentrate on monitoring. (Go ahead and use
3-of-10 because everyone does, and it adds a reasonably low level of
storage overhead.)

Not coincidentally, Least Authority Enterprises (our startup company)
has been spending most of our engineering effort on monitoring,
measurements, and fault detection for the last couple of months. Our
service is still not functional enough to advertise it as non-alpha.
This monitoring and operations engineering is a lot of work!

Regards,

Zooko

P.S. But if you want to help us alpha-test our service, by all means
let us know! :-)