[tahoe-lafs-trac-stream] [tahoe-lafs] #1392: if you have fewer than 1000 measurements, return None (meaning "I don't know") when asked for the 99.9% percentile.
tahoe-lafs
trac at tahoe-lafs.org
Sat Apr 23 10:49:37 PDT 2011
#1392: if you have fewer than 1000 measurements, return None (meaning "I don't
know") when asked for the 99.9% percentile.
-------------------------------+----------------------------------
Reporter: arch_o_median | Owner: arch_o_median
Type: defect | Status: new
Priority: minor | Milestone: undecided
Component: code-storage | Version: 1.8.2
Resolution: | Keywords: design-review-needed
Launchpad Bug: |
-------------------------------+----------------------------------
Comment (by arch_o_median):
Replying to [comment:10 zooko]:
> Rather than returning {{{None}}} for all queries when fewer than 1000
samples have been taken, I would like for it to return answers for those
queries which it can currently answer and return {{{None}}} for those that
it can't. So if you've taken 100 samples and someone asks you for the 99th
percentile, you should say "the 99'th percentile is 1234", and if they ask
you for the 99.9th percentile, you should say "I do not know the 99.9th
percentile".
>
> arch_o_median: what do you think?
>
> Anyone else reading this: does that make sense?
>
> The current behavior is for it to estimate the pickier percentiles, for
example if you have taken only one sample and ask it for the 99.9th
percentile it will return that one value, but this violates one of my
rules of measurement/monitoring/analysis/visualization, which is: "Never
give a plausible answer when the truth is that you don't know."
>
> Something that made me care a lot about this rule was when a monitoring
system used by !SimpleGeo showed me a graph of a CPU's load being flatly
zero for a long time, then suddenly going crazy for a while, then
gradually returning to almost zero. I spent several hours trying to
diagnose ongoing problems in the complex, multi-server system and trying
to understand how it could have exhibited that behavior when as far as I
could tell the aberrant behavior should have begun much earlier than it
had.
>
> Finally it turned out that it ''had'' been going crazy during the
initial period, but the measurement tool had been turned off, and the
visualization represented that to me as a continuous 0 instead of as a
don't-know, thus wasting my hours by showing me false data.
>
> Returning your best guess for a 99.9th percentile when you don't have
1000 samples isn't as bad as returning 0 when you don't know, but I think
it is better to return don't-know in this case. The argument that clinched
it for me is that if it returns an answer then you can't distinguish
between having 100 samples and the worst one being 1234 versus having 1000
samples and then 999th one being 1234, so there is a loss of information
compared with if you return don't-know ({{{None}}}).
--
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1392#comment:13>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list