#1392 closed defect (fixed)
if you have fewer than 1000 measurements, return None (meaning "I don't know") when asked for the 99.9% percentile.
Reported by: | arch_o_median | Owned by: | arch_o_median |
---|---|---|---|
Priority: | minor | Milestone: | 1.9.0 |
Component: | code-storage | Version: | 1.8.2 |
Keywords: | reviewed | Cc: | |
Launchpad Bug: |
Description
This patch tests the behavior of get_latencies in two ways:
One:
The get_latencies method is covered.
Two:
In the case of a sample size of less than 1000 in a category get_latencies should return None as the value for each quantile in that category. If get_latencies simply reports the nth percentile ignoring small sample sizes the interpretation of identical values for 99th and 99.9th percentiles can become ambiguous. That is one cannot tell if they have the same value because there are many latencies with the same value at the `big' end of the distribution, or if there are simply few observations and the value at the same index in the list is being reported twice.
Attachments (13)
Change History (38)
Changed at 2011-04-14T22:56:44Z by arch_o_median
comment:1 Changed at 2011-04-14T23:02:12Z by zooko
- Owner set to zooko
- Status changed from new to assigned
comment:2 Changed at 2011-04-14T23:07:19Z by zooko
- Owner changed from zooko to arch_o_median
- Status changed from assigned to new
Changed at 2011-04-15T00:03:37Z by arch_o_median
comment:3 Changed at 2011-04-15T00:44:07Z by arch_o_median
- Keywords review-needed added; review needed removed
comment:4 Changed at 2011-04-16T19:37:06Z by zooko
- Owner changed from arch_o_median to zooko
- Status changed from new to assigned
comment:5 Changed at 2011-04-18T23:01:03Z by zooko
- Owner changed from zooko to arch_o_median
- Status changed from assigned to new
- Summary changed from Extended testing of server.get_latencies for the null backend case to if you have less than 1000 measurements, return None (meaning "I don't know") when asked for the 99.9% percentile.
comment:6 Changed at 2011-04-18T23:01:23Z by zooko
- Keywords review-needed removed
Unsetting the review-needed keyword pending new cleanly-applying patches.
comment:7 Changed at 2011-04-19T01:28:30Z by zooko
- Summary changed from if you have less than 1000 measurements, return None (meaning "I don't know") when asked for the 99.9% percentile. to if you have fewer than 1000 measurements, return None (meaning "I don't know") when asked for the 99.9% percentile.
Changed at 2011-04-22T00:28:59Z by arch_o_median
comment:8 Changed at 2011-04-22T00:30:23Z by arch_o_median
The patch get_latenciesupgrade implements the test of get_latencies that fails ambiguous reporting for small sample sizes in test_storage.
comment:9 Changed at 2011-04-22T03:09:15Z by zooko
- Keywords review-needed added
- Owner changed from arch_o_median to zooko
- Status changed from new to assigned
comment:10 follow-up: ↓ 13 Changed at 2011-04-22T04:31:38Z by zooko
- Owner changed from zooko to arch_o_median
- Status changed from assigned to new
Rather than returning None for all queries when fewer than 1000 samples have been taken, I would like for it to return answers for those queries which it can currently answer and return None for those that it can't. So if you've taken 100 samples and someone asks you for the 99th percentile, you should say "the 99'th percentile is 1234", and if they ask you for the 99.9th percentile, you should say "I do not know the 99.9th percentile".
arch_o_median: what do you think?
Anyone else reading this: does that make sense?
The current behavior is for it to estimate the pickier percentiles, for example if you have taken only one sample and ask it for the 99.9th percentile it will return that one value, but this violates one of my rules of measurement/monitoring/analysis/visualization, which is: "Never give a plausible answer when the truth is that you don't know."
Something that made me care a lot about this rule was when a monitoring system used by SimpleGeo showed me a graph of a CPU's load being flatly zero for a long time, then suddenly going crazy for a while, then gradually returning to almost zero. I spent several hours trying to diagnose ongoing problems in the complex, multi-server system and trying to understand how it could have exhibited that behavior when as far as I could tell the aberrant behavior should have begun much earlier than it had.
Finally it turned out that it had been going crazy during the initial period, but the measurement tool had been turned off, and the visualization represented that to me as a continuous 0 instead of as a don't-know, thus wasting my hours by showing me false data.
Returning your best guess for a 99.9th percentile when you don't have 1000 samples isn't as bad as returning 0 when you don't know, but I think it is better to return don't-know in this case. The argument that clinched it for me is that if it returns an answer then you can't distinguish between having 100 samples and the worst one being 1234 versus having 1000 samples and then 999th one being 1234, so there is a loss of information compared with if you return don't-know (None).
comment:11 Changed at 2011-04-22T21:45:22Z by zooko
- Keywords design-review-needed added; review-needed removed
Changed at 2011-04-23T17:27:08Z by arch_o_median
comment:12 Changed at 2011-04-23T17:45:58Z by arch_o_median
Per Zooko's request the latest version of test_latencies/get_latencies has behavior that depends on the latency sample size.
Recapitulation:
(The Problem)
The notion of a percentile becomes ambiguous as the precision in the 'percentile' reported becomes over specific for the quantity of data provided. For example, if the size of a sample is less than 10 then the 01th percentile and the 10th percentile refer to the same index (the first) in the sorted list of samples. This matches the definition of a percentile, that is both 1 and 10 percent of the data is less than the first element, but can be misleading in interpretation. If the consumer believes that the 01th and 10th percentile should refer to different indices in the list then they will be mistaken. The intuition that different percentiles are references to different indices is reasonable and should be supported. The degree to which the percentiles _are_ distinct is a function of their precision and the size of the sample. Larger samples permit more precise percentiles to be meaningful. I use the word 'resolution' in my head when I think of this concept. Larger samples sizes permit higher 'resolution'.
(The Solution):
Indistinct percentiles are indicative of insufficient resolution for the specified percentile. 'Indistinct' can be simply defined as multiple references by different percentiles to the same index. The fix is quite simple, if percentiles are indistinct, they should return/report None instead of an index.
Caveat: It is, of course, possible to render all percentiles indistinct by specifying over-precise adjacent percentiles. This hack was created with the given percentile list in mind, that is, I am operating on the assumption that the consumer believes .99 and .999 to be different things but does not need to know whether .999 and .9999 are different quantities.
comment:13 in reply to: ↑ 10 Changed at 2011-04-23T17:49:37Z by arch_o_median
Replying to zooko:
Rather than returning None for all queries when fewer than 1000 samples have been taken, I would like for it to return answers for those queries which it can currently answer and return None for those that it can't. So if you've taken 100 samples and someone asks you for the 99th percentile, you should say "the 99'th percentile is 1234", and if they ask you for the 99.9th percentile, you should say "I do not know the 99.9th percentile".
arch_o_median: what do you think?
Anyone else reading this: does that make sense?
The current behavior is for it to estimate the pickier percentiles, for example if you have taken only one sample and ask it for the 99.9th percentile it will return that one value, but this violates one of my rules of measurement/monitoring/analysis/visualization, which is: "Never give a plausible answer when the truth is that you don't know."
Something that made me care a lot about this rule was when a monitoring system used by SimpleGeo showed me a graph of a CPU's load being flatly zero for a long time, then suddenly going crazy for a while, then gradually returning to almost zero. I spent several hours trying to diagnose ongoing problems in the complex, multi-server system and trying to understand how it could have exhibited that behavior when as far as I could tell the aberrant behavior should have begun much earlier than it had.
Finally it turned out that it had been going crazy during the initial period, but the measurement tool had been turned off, and the visualization represented that to me as a continuous 0 instead of as a don't-know, thus wasting my hours by showing me false data.
Returning your best guess for a 99.9th percentile when you don't have 1000 samples isn't as bad as returning 0 when you don't know, but I think it is better to return don't-know in this case. The argument that clinched it for me is that if it returns an answer then you can't distinguish between having 100 samples and the worst one being 1234 versus having 1000 samples and then 999th one being 1234, so there is a loss of information compared with if you return don't-know (None).
comment:14 Changed at 2011-04-23T17:51:14Z by arch_o_median
I strongly disapprove of 'statistics' are over-precise... NNT would agree with me I think.
comment:15 Changed at 2011-04-25T21:01:30Z by zooko
I asked Brian on IRC if he would be okay with this change. He replied something like "Yeah, but you need to find all consumers which are using that result and change them to handle None instead of numbers in those cases."
So, I'll do that part.
Hm, reviewing the patch now -- if you have 10 samples then you could give an answer for 90th percentile, right? The highest sample would be the 90th percentile, if I understand the definition of "percentile" correctly. The current patch has None for 90th percentile when there are 10 samples. Why is that?
Changed at 2011-04-26T21:04:54Z by arch_o_median
comment:16 Changed at 2011-04-26T21:07:02Z by zooko
arch_o_median will do the "finding all consumers" part. There are two types of consumers:
- those that invoke the Python API, i.e. any Python code that invokes get_stats() or get_latencies()
- those that use the WUI/WAPI, i.e. any HTTP clients that are reading the HTTP results rendered by web.storage.StorageStatus.render_JSON() or the "stats" magic Nevow variable. Looking at stats.rst, it appears that only code that is looking for the key latencies in the resulting stats dict could be affected. Grepping for latencies in the Tahoe-LAFS code base, I see that misc/operations_helpers/munin/tahoe_server_latency_ is the only such in our source tree. There is documentation about the Munin plugins in stats.rst, and also at the top of the munin/tahoe_server_latency_ file.
Changed at 2011-04-26T21:17:36Z by arch_o_median
comment:17 Changed at 2011-04-26T21:18:19Z by arch_o_median
OK, the latest patch _also_ has an updated docstring for get_latencies.
comment:18 Changed at 2011-04-27T06:02:19Z by zooko
- Keywords design-review-needed removed
attachment:test_get_latencies_upgraded01.darcs.2.patch looks good to me. After you hunt down all code and tools that use these stats (i.e. the munin script) and if needed update the docs/stats.rst file, I'll commit the patch(es) to trunk.
Changed at 2011-05-10T20:33:13Z by arch_o_median
comment:19 Changed at 2011-05-10T20:33:25Z by arch_o_median
Okay heres a grepped list of "def get_stats". This lets me know what the possibilities are for a given invocation.
Changed at 2011-05-17T18:00:43Z by arch_o_median
A warner-compiled list of consumers in irc copy-paste format
comment:20 Changed at 2011-05-17T18:42:55Z by zooko
- Keywords reviewed news-needed added
attachment:test_get_latencies_upgraded01.darcs.2.patch and attachment:statsdoc.darcs.patch look good to me. I read through Brian's search for consumers in attachment:statsdoc.darcs.patch and agree that there are probably no consumers that will get tripped up by this change. There are two remaining items: 1. a manual test by using a server that has this patch and then looking at its /statistics page (arch_o_median plans to do this tonight), 2. add an entry to source:NEWS.rst so that anyone who is using the json HTTP interface to get stats will see a note saying that this has changed. To remind us of the latter, I've added the news-needed tag to this ticket.
Changed at 2011-05-19T14:44:05Z by arch_o_median
adds mention of percentile correction to docs/stats.rst
Changed at 2011-05-23T23:46:40Z by arch_o_median
Contains previous patches and a modification to interfaces to allow None to be among values in get_stats returned dict.
comment:21 Changed at 2011-05-24T00:03:15Z by arch_o_median
When I ran the whole test suite "python ./setup.py test" two tests failed. The dictionary returned by get_stats no longer conformed to the constraints expected in interfaces.py as it contained "None" values. I modified the ChoiceOf call in the relevant interface to include a "None", element as an argument. This allows the test suite to pass. Beyond the fact that the tests now pass I do not know what effects such a change might have, i.e. I've not examined the definition of ChoiceOf, so review is certainly called for.
Changed at 2011-05-24T00:10:59Z by arch_o_median
Example of "operational statistics" output with "None" values. (Manual test)
Changed at 2011-05-24T15:50:34Z by arch_o_median
complete patch: teststorage, stats, interfaces, NEWS all modified
comment:22 Changed at 2011-05-24T15:54:57Z by arch_o_median
The last file attached to this ticket contains what I believe to be a complete patch. Among its contents are a unit test, modified get_latencies function, modified RIStatsProvider interface, modified docs in multiple files. Also attached to this ticket is a file "statistics.html" that contains the output of "Operational Statistics" using the patched code and small numbers of operations, with the resultant "Nones".
comment:23 Changed at 2011-05-26T13:17:36Z by zooko
- Keywords news-needed removed
attachment:patchcompletewithdocchanges.darcs.patch looks great!
comment:24 Changed at 2011-05-27T12:09:14Z by wilcoxjg@…
- Resolution set to fixed
- Status changed from new to closed
In 67ad0175cd3e4870:
comment:25 Changed at 2011-05-27T12:57:08Z by zooko
- Milestone changed from undecided to 1.9.0
attachment:testgetlatencies.darcs.patch and/or attachment:testgetlatencies01.darcs.patch don't apply cleanly to trunk. arch_o_median is going to create new cleanly-applying patches for this ticket.
I'm renaming this ticket to reflect the desired (by me and arch_o_median at least) functionality.