[tahoe-lafs-trac-stream] [tahoe-lafs] #1392: if you have fewer than 1000 measurements, return None (meaning "I don't know") when asked for the 99.9% percentile.

Sat Apr 23 10:49:37 PDT 2011

#1392: if you have fewer than 1000 measurements, return None (meaning "I don't
know") when asked for the 99.9% percentile.
-------------------------------+----------------------------------
     Reporter:  arch_o_median  |      Owner:  arch_o_median
         Type:  defect         |     Status:  new
     Priority:  minor          |  Milestone:  undecided
    Component:  code-storage   |    Version:  1.8.2
   Resolution:                 |   Keywords:  design-review-needed
Launchpad Bug:                 |
-------------------------------+----------------------------------

Comment (by arch_o_median):

 Replying to [comment:10 zooko]:
 > Rather than returning {{{None}}} for all queries when fewer than 1000
 samples have been taken, I would like for it to return answers for those
 queries which it can currently answer and return {{{None}}} for those that
 it can't. So if you've taken 100 samples and someone asks you for the 99th
 percentile, you should say "the 99'th percentile is 1234", and if they ask
 you for the 99.9th percentile, you should say "I do not know the 99.9th
 percentile".
 >
 > arch_o_median: what do you think?
 >
 > Anyone else reading this: does that make sense?
 >
 > The current behavior is for it to estimate the pickier percentiles, for
 example if you have taken only one sample and ask it for the 99.9th
 percentile it will return that one value, but this violates one of my
 rules of measurement/monitoring/analysis/visualization, which is: "Never
 give a plausible answer when the truth is that you don't know."
 >
 > Something that made me care a lot about this rule was when a monitoring
 system used by !SimpleGeo showed me a graph of a CPU's load being flatly
 zero for a long time, then suddenly going crazy for a while, then
 gradually returning to almost zero. I spent several hours trying to
 diagnose ongoing problems in the complex, multi-server system and trying
 to understand how it could have exhibited that behavior when as far as I
 could tell the aberrant behavior should have begun much earlier than it
 had.
 >
 > Finally it turned out that it ''had'' been going crazy during the
 initial period, but the measurement tool had been turned off, and the
 visualization represented that to me as a continuous 0 instead of as a
 don't-know, thus wasting my hours by showing me false data.
 >
 > Returning your best guess for a 99.9th percentile when you don't have
 1000 samples isn't as bad as returning 0 when you don't know, but I think
 it is better to return don't-know in this case. The argument that clinched
 it for me is that if it returns an answer then you can't distinguish
 between having 100 samples and the worst one being 1234 versus having 1000
 samples and then 999th one being 1234, so there is a loss of information
 compared with if you return don't-know ({{{None}}}).

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1392#comment:13>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage