[tahoe-dev] [tahoe-lafs] #1264: Performance regression for large values of K

Wed Nov 24 07:42:34 UTC 2010

#1264: Performance regression for large values of K
------------------------------+---------------------------------------------
     Reporter:  francois      |       Owner:                                
         Type:  defect        |      Status:  new                           
     Priority:  major         |   Milestone:  soon                          
    Component:  code-network  |     Version:  1.8.0                         
   Resolution:                |    Keywords:  performane regression download
Launchpad Bug:                |  
------------------------------+---------------------------------------------

Comment (by zooko):

 > You'll find the [attachment:tahoe.profile profile generated] by
 downloading a 20-of-60 10 MB file and then stopping the client. I'll
 hopefully address the remaining points tomorrow.

 Okay I did a quick dump of the stats from [attachment:tahoe.profile],
 first sorted by total time spent in each function excluding time spent
 waiting for functions that it called to return ("tottime"), and then
 sorted by total time spent in each function including time spent waiting
 for functions it called to return ("cumtime").
 {{{
 >>> p.sort_stats('time').print_stats(8)
 Tue Nov 23 14:36:37 2010    tahoe.profile

          21516265 function calls (21475767 primitive calls) in 240.371 CPU
 seconds

    Ordered by: internal time
    List reduced from 1491 to 8 due to restriction <8>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1406  147.383    0.105  147.492    0.105 {select.select}
   1370101   21.221    0.000   21.221    0.000 {posix.stat}
     50882   14.011    0.000   51.913    0.001
 traceback.py:273(extract_stack)
   1370097    7.485    0.000   28.704    0.000 linecache.py:43(checkcache)
     50882    7.292    0.000    8.031    0.000 traceback.py:27(format_list)
   1370097    3.531    0.000    7.492    0.000 linecache.py:13(getline)
      2031    2.392    0.001    2.392    0.001 {built-in method send}
   1370097    1.749    0.000    3.435    0.000 linecache.py:33(getlines)

          21516265 function calls (21475767 primitive calls) in 240.371 CPU
 seconds

    Ordered by: cumulative time
    List reduced from 1491 to 44 due to restriction <44>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000  240.371  240.371 base.py:1046(run)
         1    0.025    0.025  240.361  240.361 base.py:1051(mainLoop)
      1406    0.062    0.000  190.230    0.135
 selectreactor.py:93(doSelect)
      1406  147.383    0.105  147.492    0.105 {select.select}
     50882    1.038    0.000   61.047    0.001
 traceback.py:264(format_stack)
     50882   14.011    0.000   51.913    0.001
 traceback.py:273(extract_stack)
      1406    0.053    0.000   50.059    0.036 base.py:666(runUntilCurrent)
       874    0.054    0.000   49.797    0.057 eventual.py:18(_turn)
 25381/20221    0.116    0.000   46.592    0.002 defer.py:233(callback)
 25391/20231    0.392    0.000   46.453    0.002
 defer.py:291(_startRunCallbacks)
      6377    0.277    0.000   45.856    0.007 share.py:188(loop)
      6377    0.180    0.000   44.419    0.007 share.py:239(_do_loop)
      4672    0.049    0.000   42.653    0.009 log.py:38(callWithLogger)
      4672    0.063    0.000   42.597    0.009 log.py:33(callWithContext)
      4672    0.035    0.000   42.467    0.009
 context.py:58(callWithContext)
      4672    0.053    0.000   42.428    0.009
 context.py:32(callWithContext)
      4672    0.048    0.000   42.358    0.009
 selectreactor.py:144(_doReadOrWrite)
 5200/5143    0.101    0.000   39.445    0.008 defer.py:84(maybeDeferred)
      2212    0.043    0.000   39.311    0.018 tcp.py:345(doRead)
      2136    0.038    0.000   38.325    0.018 tcp.py:116(doRead)
      6377    0.405    0.000   37.374    0.006 share.py:709(_send_requests)
      1893    0.087    0.000   37.301    0.020 banana.py:633(dataReceived)
      4999    0.035    0.000   37.245    0.007
 referenceable.py:406(callRemote)
      1893    1.008    0.001   37.210    0.020 banana.py:701(handleData)
      4999    0.273    0.000   37.103    0.007
 referenceable.py:418(_callRemote)
      4777    0.029    0.000   34.931    0.007 share.py:741(_send_request)
     25491    0.333    0.000   31.085    0.001 defer.py:166(__init__)
 50890/30329    0.944    0.000   30.678    0.001
 defer.py:314(_runCallbacks)
      5000    0.100    0.000   30.631    0.006 banana.py:183(send)
      5000    0.120    0.000   30.525    0.006 root.py:92(send)
   1370097    7.485    0.000   28.704    0.000 linecache.py:43(checkcache)
     10106    0.082    0.000   24.185    0.002 defer.py:30(succeed)
      5582    0.183    0.000   22.692    0.004 banana.py:1049(handleClose)
      4999    0.110    0.000   21.742    0.004 call.py:652(receiveClose)
   1370101   21.221    0.000   21.221    0.000 {posix.stat}
      5072    0.829    0.000   19.518    0.004 banana.py:191(produce)
     10072    0.118    0.000   13.274    0.001 root.py:72(next)
     16851    0.085    0.000   12.370    0.001 banana.py:1035(handleToken)
      9998    0.095    0.000   12.250    0.001 call.py:628(receiveChild)
 50868/44994    0.268    0.000   11.336    0.000 defer.py:172(addCallbacks)
      4999    0.026    0.000    9.349    0.002 call.py:673(_done)
      4999    0.054    0.000    9.323    0.002 call.py:55(complete)
     50882    7.292    0.000    8.031    0.000 traceback.py:27(format_list)
   1370097    3.531    0.000    7.492    0.000 linecache.py:13(getline)
 }}}

 Here are my comments/questions about this:
  * Only the first few rows of the tottime list are at all significant. The
 total CPU time was 240 seconds, so the 3rd row -- {{{extract_stack}}} --
 used about 5% of the CPU time and the 4th row -- {{{checkcache}}} -- used
 about 3%.
  * Why is {{{extract_stack}}} getting caled 50,882 times!? Here are all of
 the callers which called {{{extract_stack}}} in this run:
 {{{
 >>> p.print_callers('extract_stack')
    Ordered by: cumulative time
    List reduced from 1491 to 1 due to restriction <'extract_stack'>

 Function                         was called by...
                                      ncalls  tottime  cumtime
 traceback.py:273(extract_stack)  <-   50882   14.011   51.913
 traceback.py:264(format_stack)

 <pstats.Stats instance at 0x1004e0950>
 >>> p.print_callers('format_stack')
    Ordered by: cumulative time
    List reduced from 1491 to 1 due to restriction <'format_stack'>

 Function                        was called by...
                                     ncalls  tottime  cumtime
 traceback.py:264(format_stack)  <-   25491    0.591   30.752
 defer.py:166(__init__)
                                      25391    0.447   30.295
 defer.py:291(_startRunCallbacks)

 <pstats.Stats instance at 0x1004e0950>
 }}}
 Ah, this indicates that {{{twisted.internet.defer.debug=True}}}. Okay. I
 wonder how it got turned on?
  * Okay, and why is {{{posix.stat}}} invoked more than a million times?
 Ah, {{{print_callers()}}} tells me that {{{posix.stat}}} is getting called
 by {{{checkcache}}} which is getting called by {{{extract_stack}}}. So all
 told, around 23% of the CPU time (as measured in "tottime" terms instead
 of "cumtime" terms) was spent in doing the debugging stuff due to
 {{{defer.debug=True}}}. This makes me wonder what the overall throughput
 would be if {{{defer.debug}}} were turned off.

 Next, I turn my attention to the "cumtime" results.

  * Measured this way, the CPU time spent in the {{{defer.debug}}} stuff
 was 25% of all CPU time. (And this is is a better measure of this than the
 tottime result is.)
  *
 [source:trunk/src/allmydata/immutable/downloader/share.py?annotate=blame&rev=4707#L188
 share.py:188(loop)] was called 6377 times and took up a total of 19% of
 all CPU time. This might be a smoking gun showing that Brian's analysis in
 #1268 is correct and explains the slowdown that this ticket (#1264) is
 about.
  *
 [source:trunk/src/allmydata/immutable/downloader/share.py?annotate=blame&rev=4707#L709
 share.py:709(_send_requests)], called 6377 times, total of 37.374 seconds
 (15.5% of the total). 6 milliseconds per call. Is this more evidence in
 favor of the #1268 theory? What functions does {{{_send_requests}}} call,
 and how long do they take?
 {{{
 >>> p.strip_dirs().print_callees('_send_requests')
    Random listing order was used
    List reduced from 1491 to 1 due to restriction <'_send_requests'>

 Function                      called...
                                   ncalls  tottime  cumtime
 share.py:709(_send_requests)  ->    9554    0.024    0.067
 defer.py:189(addCallback)
                                     9554    0.022    0.064
 defer.py:197(addErrback)
                                    11154    0.207    0.652
 log.py:144(msg)
                                     4777    0.029   34.931
 share.py:741(_send_request)
                                     4777    0.083    0.210
 spans.py:46(add)
                                    19131    0.132    0.196
 spans.py:142(dump)
                                    11154    0.014    0.014
 spans.py:152(__iter__)
                                    12754    0.060    0.605
 spans.py:170(__sub__)
                                     6377    0.021    0.069
 spans.py:256(get_spans)
                                     4777    0.049    0.066
 status.py:97(add_request_sent)
                                    11154    0.034    0.093  {repr}
                                     4777    0.002    0.002  {time.time}
 }}}
 Huh-oh, I think this shows that cProfile is giving us bad data! The
 cumulative time spent in {{{_send_request}}}, including waiting for its
 callees to return, was 34.931s, and the total time spent in
 {{{_send_request}}} not waiting for its callees to return was 0.029s,
 which means that it was waiting for its callees to return for almost 35s.
 But, the total time spent in all of its callees adds up to about 2s
 seconds. That doesn't make sense! :-(

 Well, it is obviously time for me to go to bed. Hopefully it will make
 more sense when I get back to this next weekend.

 One last comment: it is surprising how much time is spent in banana.py and
 its children! (That's the code in foolscap to serialize/deserialize
 things.)

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1264#comment:9>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage