[tahoe-dev] [tahoe-lafs] #1264: Performance regression for large values of K

Tue Nov 23 14:48:54 UTC 2010

#1264: Performance regression for large values of K
------------------------------+---------------------------------------------
     Reporter:  francois      |       Owner:                                
         Type:  defect        |      Status:  new                           
     Priority:  major         |   Milestone:  soon                          
    Component:  code-network  |     Version:  1.8.0                         
   Resolution:                |    Keywords:  performane regression download
Launchpad Bug:                |  
------------------------------+---------------------------------------------

Comment (by zooko):

 Replying to [comment:5 francois]:
 >
 > An quick analysis of [attachment:"Tahoe-LAFS - File Download
 Status.html"] shows that some segments requests were delayed because a
 single share request was taking too long. The average time for a share
 request during this download was about 100 ms, but it raised to a bit more
 than one second in a few cases such as this one.
 >
 > {{{
 > qnsglseg      21      [6590:+6554]    +0.896405s      +1.985190s
 6554    1.09s
 > }}}
 >
 > I think that those share requests taking too long are probably due to
 the overall load on the grid and are probably difficult to avoid.

 I don't understand why a few of the share requests would take ten times as
 long as normal. Is the delay on the client, the server, or the network?
 Brian hypothesized that it had something to do with how the spans data
 structure gets used more when K is higher. Maybe running the client
 (gateway, downloader) under a Python profiler would give us a clue.
 (David-Sarah has pointed out that tahoe has support for that!
 comment:2:ticket:1267 .)

 You could also try running oprofile to get information about what the CPU
 is doing at the native-code (x86/amd64) level. You could try either or
 both of these on the client and on the servers.

 But, if the slowdown is due to waiting for disk or for network rather than
 due to spinning the CPU, these tools won't show it. Also if you get
 unlucky and don't encounter any of these 1-second-long pauses while you
 are profiling or if you encounter so few of them that the functions
 responsible for the delay don't accumulate a lot of CPU time over the
 whole profiling run then they could evade detection.

 On the gripping hand, a good profiling result is almost always
 interesting. It either tells us about some functions that are wasting CPU
 cycles or it tells us that there are no functions that waste CPU cycles
 (under this load). So I would encourage you to try it.

 > Now, I don't really understand if and why this issue was not present in
 v1.7.1 because it seems that segement pipelining wasn't already
 implemented in the previous downloader.

 Yes this is definitely one of the mysteries. What changed between v1.7.1
 and v1.8? The addition of the spans data structure is one thing...

 Oh! This would probably help:

 http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1170/debuggery-
 trace-spans.dpatch.txt

 That patch adds logging of every time {{{Share._received}}} gets touched,
 which is one of the most commonly-used spans objects.

 This tool reads logs like that and benchmarks how fast a given spans
 implementation can handle such usage:
 [source:trunk/misc/simulators/bench_spans.py]

 Applying that patch to your downloader could tell us whether the
 {{{Share._received}}} object is getting used a lot, and then running
 bench_spans.py on the resulting log could tell us whether the current
 implementation of spans has some inefficient algorithm.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1264#comment:6>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage