[volunteergrid2-l] disk drive failure statistics

Eugen Leitl eugen at leitl.org
Sat Apr 7 08:43:57 UTC 2012


On Fri, Apr 06, 2012 at 08:59:27PM -0600, erpo41 at gmail.com wrote:

> In February of 2007, Google published a paper titled "Failure Trends
> in a Large Disk Drive Population"

That data set is now ancient. 

> (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf)
> analyzing the impact of various environmental factors and SMART
> readings on disk drive failure rates. This paragraph really caught my
> attention:

...

> I don't know about anyone else, but I want that data so I can choose
> the most reliable hard drives from the most reliable manufacturers.
> Furthermore, I want that data to be made public so hard drive
> manufacturers will face real pressure to improve reliability.

This is not as relevant as you might think. Hard drives have been 
a high-competition low-margin business in the last decade, so there 
has been considerable manufacturer consolidation. The situation was 
exacerbated by the Thailand flood, so manufacturer diversity is at a new
low (e.g. WD and Hitachi are now one manufacturer).

Moreover, the model line from one manufacturer can come from
different fabs, so the only reliable reference is the manufacturer
number -- and let's cross fingers, and hope you won't get a
bad lot which you'll only find out when a few years later all
your drives purchased in a particular period start dropping like
flies. So historic data help you little here, as you no longer
can purchase the drives that have shown to be particularly reliable.
 
> I've thought about several schemes for collecting this data from PCs
> across the world, but that effort is complicated by the fact that most
> desktops are not on and connected to the Internet 24/7. If a PC is off
> when its disk fails, or if it's not connected to the Internet, it
> won't be able to report the failure ever.
> 
> I think you see where I'm going with this. Tahoe-LAFS/VG2 may be the
> ideal way to collect this type of data. So, two questions:

The sample size is far too low. I probably have more running spindles
than the entire VG2, and we're a very small shop as they come.
 
> 1. Is there any reason why someone would object to having the tahoe
> client/server collect disk failure statistics and report them to a
> central server? Should this feature be opt-in or opt-out?
> 
> 2. Does anyone see any potential for error in this scheme?


More information about the volunteergrid2-l mailing list