[tahoe-dev] [tahoe-lafs] #687: too many "false alarms" in incident reporting
tahoe-lafs
trac at allmydata.org
Sun Apr 26 13:51:39 PDT 2009
#687: too many "false alarms" in incident reporting
--------------------+-------------------------------------------------------
Reporter: zooko | Owner: somebody
Type: defect | Status: new
Priority: major | Milestone: undecided
Component: code | Version: 1.4.1
Keywords: | Launchpad_bug:
--------------------+-------------------------------------------------------
Comment(by warner):
yeah. part of the reason for producing Incidents is to learn which
exceptions are happening frequently so we can understand and downgrade
them. If a specific kind of incident is happening a lot, then either it
needs to be fixed or ignored (well, the specific log message that triggers
the event should be reduced in severity, below the threshold which
triggers incident reporting).
How many incidents are you seeing? And what is triggering them? The
Foolscap package provides a CLI tool named "flogtool", and you can run
"flogtool dump INCIDENTFILE" to read the contents of the incident file.
grep for "TRIGGER" to see the specific event that triggered the Incident
(the incident record includes events before and after the trigger).
There is an incomplete set of tools for collecting and classifying
Incidents, which includes code to label each incident with a "category"
and sort them that way. The longer term goal is to produce a web page
which shows recent Incidents, and how many Incidents of each category have
been produced, to make it more obvious which ones are false-positives (and
should be fixed by complaining less) and which ones are significantly
unusual (and should be fixed by addressing the bug).
The overall goal is to highlight things that need attention. Getting a
corrupt share from a server seemed to fall into this category: although we
can handle it just fine, the odds of it happening are so low that somebody
should look into it (either to tell the server operator that they're
having disk problems, or to tell the server operator to please stop
scribbling on your shares, or to run a memory tester on your own machine,
or something).
But having a server go away during the download should certainly not
trigger an Incident.. that's the sort of log message which needs to be
deprioritized.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/687#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list