[tahoe-dev] [tahoe-lafs] #687: too many "false alarms" in incident reporting

Sun Apr 26 13:51:39 PDT 2009

#687: too many "false alarms" in incident reporting
--------------------+-------------------------------------------------------
 Reporter:  zooko   |           Owner:  somebody 
     Type:  defect  |          Status:  new      
 Priority:  major   |       Milestone:  undecided
Component:  code    |         Version:  1.4.1    
 Keywords:          |   Launchpad_bug:           
--------------------+-------------------------------------------------------

Comment(by warner):

 yeah. part of the reason for producing Incidents is to learn which
 exceptions are happening frequently so we can understand and downgrade
 them. If a specific kind of incident is happening a lot, then either it
 needs to be fixed or ignored (well, the specific log message that triggers
 the event should be reduced in severity, below the threshold which
 triggers incident reporting).

 How many incidents are you seeing? And what is triggering them? The
 Foolscap package provides a CLI tool named "flogtool", and you can run
 "flogtool dump INCIDENTFILE" to read the contents of the incident file.
 grep for "TRIGGER" to see the specific event that triggered the Incident
 (the incident record includes events before and after the trigger).

 There is an incomplete set of tools for collecting and classifying
 Incidents, which includes code to label each incident with a "category"
 and sort them that way. The longer term goal is to produce a web page
 which shows recent Incidents, and how many Incidents of each category have
 been produced, to make it more obvious which ones are false-positives (and
 should be fixed by complaining less) and which ones are significantly
 unusual (and should be fixed by addressing the bug).

 The overall goal is to highlight things that need attention. Getting a
 corrupt share from a server seemed to fall into this category: although we
 can handle it just fine, the odds of it happening are so low that somebody
 should look into it (either to tell the server operator that they're
 having disk problems, or to tell the server operator to please stop
 scribbling on your shares, or to run a memory tester on your own machine,
 or something).

 But having a server go away during the download should certainly not
 trigger an Incident.. that's the sort of log message which needs to be
 deprioritized.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/687#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid