[volunteergrid2-l] failed nodes, repair and availability

Wed Feb 15 22:57:01 UTC 2012

Also, you can run

   tahoe deep-check --repair --add-lease ALIAS

and do both at once.  And if you use the default alias (tahoe), you can
omit the alias.

On Wed, Feb 15, 2012 at 3:55 PM, Johannes Nix <Johannes.Nix at gmx.net> wrote:

> Hello Steve,
>
> Steve Dodson <steve.dodson at gmail.com> wrote:
>
> > Shawn - You suggest running the repair process periodically - would
> > you mind expounding further on this?  i.e. what command and how
> > often?
>
> you need to do
>
> $ tahoe deep-check --repair ALIAS:
>
> where ALIAS is an alias of a direcory which links
> to all data that you don't want to loose. Such a directory
> is called root cap, it is in some sense the
> "root directory" for _your_ data.
>
> If I understand correctly, it is recommended to run
> this once a month.
>
> Also, you need to refresh leases. Each new object will
> have a fresh lease. The recommended settings for
> the servers are such that object fragments whose leases are
> older than one year will be deleted from  the
> servers.
>
> Refreshing leases is done with
>
> $ tahoe deep-check --add-lease ALIAS:
>
> and it is recommended to run it once a week.
>
> The detailed explanation is here:
>
>
> https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/garbage-collection.rst
>
> One more thing. Shawn wrote:
>
> > >
> > > I wouldn't bother backing up the Tahoe server at all.  The whole
> > > point of Tahoe-LAFS is that the grid is distributed and replicated
> > > for safety, so loss of any one node doesn't result in the loss of
> > > any data.  It's expected that nodes will occasionally die, and
> > > that's okay; everyone should be running the repair process
> > > periodically, and that will repopulate any missing shares.  So if
> > > your node dies, just stand a new, empty, one up in its place.
>
> It is clear for me that one defect server alone will not bring down
> the grid. But it has certainly an influence on the availability.
> If one hard disk dies and I replace it and restore $BASEDIR/storage,
> the server will be back working within about two days. This
> is still an availability of 99.4 % (363 out of 365).
> If one does not replay the data and everyone does the repair
> described above within one month, this is a worst-case availability of
> about 91 % (335 out of 365 days, depending on how you count) and one
> would miss the 95 % goal.
>
> I do not have the detailed mathematics at hand but I think this
> can have quite a difference on the overall reliability.
>
> Johannes
>
> (who today almost managed to crash Windows XP by issuing
> "query replace" within a Visual Studio plugin :-P )
>
> _______________________________________________
> volunteergrid2-l mailing list
> volunteergrid2-l at tahoe-lafs.org
> http://tahoe-lafs.org/cgi-bin/mailman/listinfo/volunteergrid2-l
> http://bigpig.org/twiki/bin/view/Main/WebHome
>

-- 
Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/cgi-bin/mailman/private/volunteergrid2-l/attachments/20120215/00056617/attachment.html>