[volunteergrid2-l] failed nodes, repair and availability

Wed Feb 15 22:55:04 UTC 2012

Hello Steve,

Steve Dodson <steve.dodson at gmail.com> wrote:

> Shawn - You suggest running the repair process periodically - would
> you mind expounding further on this?  i.e. what command and how
> often?  

you need to do 

$ tahoe deep-check --repair ALIAS:

where ALIAS is an alias of a direcory which links
to all data that you don't want to loose. Such a directory
is called root cap, it is in some sense the
"root directory" for _your_ data.

If I understand correctly, it is recommended to run
this once a month.

Also, you need to refresh leases. Each new object will
have a fresh lease. The recommended settings for
the servers are such that object fragments whose leases are
older than one year will be deleted from  the
servers. 

Refreshing leases is done with 

$ tahoe deep-check --add-lease ALIAS:

and it is recommended to run it once a week.

The detailed explanation is here:

https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/garbage-collection.rst

One more thing. Shawn wrote:

> >
> > I wouldn't bother backing up the Tahoe server at all.  The whole
> > point of Tahoe-LAFS is that the grid is distributed and replicated
> > for safety, so loss of any one node doesn't result in the loss of
> > any data.  It's expected that nodes will occasionally die, and
> > that's okay; everyone should be running the repair process
> > periodically, and that will repopulate any missing shares.  So if
> > your node dies, just stand a new, empty, one up in its place.

It is clear for me that one defect server alone will not bring down
the grid. But it has certainly an influence on the availability.
If one hard disk dies and I replace it and restore $BASEDIR/storage,
the server will be back working within about two days. This
is still an availability of 99.4 % (363 out of 365). 
If one does not replay the data and everyone does the repair 
described above within one month, this is a worst-case availability of
about 91 % (335 out of 365 days, depending on how you count) and one
would miss the 95 % goal.

I do not have the detailed mathematics at hand but I think this
can have quite a difference on the overall reliability.

Johannes

(who today almost managed to crash Windows XP by issuing
"query replace" within a Visual Studio plugin :-P )