Opened at 2010-08-12T06:15:42Z
Last modified at 2010-11-23T23:38:50Z
#1170 closed defect
new-downloader performs badly when downloading a lot of data from a file — at Version 69
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 1.8.0 |
Component: | code-network | Version: | 1.8β |
Keywords: | immutable download performance regression | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
I tried to download a movie file from Test Grid with Tahoe-LAFS v1.8.0c2 (new-downloader) and the file wouldn't download fast enough or smoothly enough to watch in a streaming fashion during download.
If you look at the download status that I posted on #1169: http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1169/down-0.html , you'll see that it often took a second or two to fetch a single segment. Also, I don't exactly understand the meaning of the fields (hence #1169), but it looks like downloader was scheduling up a large number of 32-byte reads. Is that right? Were those getting coalesced before sending them?
It could just be that the current test grid has servers that take seconds to respond to requests and downloader can't do much about that. Is that what happened? Could downloader have done something about it, such as by choosing different servers instead of the ones that responded so slowly?
I guess I should compare old-downloader to new-downloader behavior on the Test Grid, but right now I have to go to sleep!
Oh, other people could do that comparison with this file if they like:
Some of Zooko's most recent measurements:
run | version | downloaded | download KBps | flags |
---|---|---|---|---|
101 | 1.8.0c2 | 100 MB | 91 | |
102 | 1.7.1 | 100 MB | 182 | |
103 | 1.7.1 | 100 MB | 207 | |
104 | 1.8.0c2 | 100 MB | 82 | |
105 | 1.7.1 | 100 MB | 211 | |
109 | 1.7.1 | 100 MB | 228 | cProfile |
110 | 1.8.0c2 | 100 MB | 113 | cProfile |
111 | 1.8.0c2 | 413 MB | 19 | spanstrace |
112 | 1.8.0c2 | 456 MB | 113 | spanstrace, spans.py.diff |
113 | 1.7.1 | 543 MB | 241 | on office network |
114 | 1.8.0c2 | 456 MB | 147 | spans.py.diff + comment:54 |
115 | 1.7.1 | 456 MB | 224 | |
116 | 1.8.0c2 | 314 MB | 154 | spans.py.diff + comment:54 |
117 | 1.8.0c2 | 36 MB | 160 | cProfile spans.py.diff + comment:54 |
118 | 1.8.0c2 | 1490 MB | 158 | cProfile spans.py.diff + comment:54 |
119 | 1.7.1 | 543 MB | 99 | on office network, noflog |
120 | 1.8.0c2 | 543 MB | 319 | on office network, noflog, spans.py.diff + comment:54 |
121 | 1.7.1 | 543 MB | 252 | on office network, noflog |
122 | 1.8.0c2 | 1490 MB | 157 | noflog, spans.py.diff + comment:54, nologging in immutable/download/share.py |
123 | 1.8.0c2 | 100 MB | 242 | noflog, spans.py.diff + comment:54, nologging in immutable/download/*.py |
124 | 1.7.1 | 100 MB | 219 | noflog |
125 | 1.8.0c2 | 100 MB | 161 | noflog, spans.py.diff + comment:54, nologging in immutable/download/*.py |
126 | 1.7.1 | 100 MB | 223 | noflog |
127 | 1.8.0c2 | 100 MB | 155 | noflog, spans.py.diff + comment:54, nologging in immutable/download/*.py |
Change History (107)
comment:1 Changed at 2010-08-12T16:28:41Z by zooko
comment:2 follow-ups: ↓ 3 ↓ 4 ↓ 11 Changed at 2010-08-12T18:15:54Z by warner
yeah, the 32/64-byte reads are hashtree nodes. The spans structure only coaleses adjacent/overlapping reads (the 64-byte reads are the result of two neighboring 32-byte hashtree nodes being fetched), but all requests are pipelined (note the "txtime" column in the "Requests" table, which tracks remote-bucket-read requests), and the overhead of each message is fairly small (also note the close proximity of the "rxtime" for those batches of requests). So I'm not particularly worried about merging these requests further.
My longer-term goal is to extend the Spans data structure with some sort of "close enough" merging feature: given a Spans bitmap, return a new bitmap with all the small holes filled in, so e.g. a 32-byte gap between two hashtree nodes (which might not be strictly needed until a later segment is read) would be retrieved early. The max-hole-size would need to be tuned to match the overhead of each remote-read message (probably on the order of 30-40 bytes): there's a breakeven point somewhere in there.
Another longer-term goal is to add a readv()-type API to the remote share-read protocol, so we could fetch multiple ranges in a single call. This doesn't shave much overhead off of just doing multiple pipelined read() requests, so again it's low-priority.
And yes, a cleverer which-share-should-I-use-now algorithm might reduce stalls like that. I'm working on visualization tools to show the raw download-status events in a Gantt-chart -like form, which should make it easier to develop such an algorithm. For now, you want to look at the Request table for correlations between reads that occur at the same time. For example, at the +1.65s point, I see several requests that take 1.81s/2.16s/2.37s . One clear improvement would be to fetch shares 0 and 5 from different servers: whatever slowed down the reads of sh0 also slowed down sh5. But note that sh8 (from the other server) took even longer: this suggests that the congestion was on your end of the line, not theirs, especially since the next segment arrived in less than half a second.
comment:3 in reply to: ↑ 2 Changed at 2010-08-12T18:38:22Z by zooko
Replying to warner:
yeah, the 32/64-byte reads are hashtree nodes. The spans structure only coaleses adjacent/overlapping reads (the 64-byte reads are the result of two neighboring 32-byte hashtree nodes being fetched), but all requests are pipelined (note the "txtime" column in the "Requests" table, which tracks remote-bucket-read requests), and the overhead of each message is fairly small (also note the close proximity of the "rxtime" for those batches of requests).
I don't understand what those columns mean (see #1169 (documentation for the new download status page)).
comment:4 in reply to: ↑ 2 Changed at 2010-08-12T19:12:36Z by zooko
Replying to warner:
For now, you want to look at the Request table for correlations between reads that occur at the same time.
I'm having trouble interpreting it (re: #1169).
For example, at the +1.65s point, I see several requests that take 1.81s/2.16s/2.37s . One clear improvement would be to fetch shares 0 and 5 from different servers: whatever slowed down the reads of sh0 also slowed down sh5. But note that sh8 (from the other server) took even longer: this suggests that the congestion was on your end of the line, not theirs, especially since the next segment arrived in less than half a second.
I tried to watch the same movie from my office network and got similarly unwatchable results, download status page attached. Could it be a problem with the way my client, VLC.app, is reading?
Changed at 2010-08-12T19:13:18Z by zooko
Changed at 2010-08-12T21:04:16Z by zooko
comment:5 Changed at 2010-08-12T21:04:27Z by zooko
Well, it wasn't the VLC.app client. I did another download of the same file using wget. The performance was bad--38 KB/s:
p$ wget http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg --2010-08-12 12:54:47-- http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg Resolving localhost... ::1, fe80::1, 127.0.0.1 Connecting to localhost|::1|:3456... failed: Connection refused. Connecting to localhost|fe80::1|:3456... failed: Connection refused. Connecting to localhost|127.0.0.1|:3456... connected. HTTP request sent, awaiting response... 200 OK Length: 1490710513 (1.4G) [application/ogg] Saving to: `bbb-360p24.i420.lossless.drc.ogg.fixed.ogg+bbb-24fps.flac.via-ffmpeg.ogg' 16% [=========================> ] 247,203,678 38.0K/s eta 8h 0m ^C
Here is the download status page for this download (attached). Note that one server had a DYHB RTT of 3 minutes and another had a DYHB RTT of 8 minutes! There were no incident report files or twistd.log entries.
comment:6 Changed at 2010-08-12T21:10:31Z by zooko
The two servers with dramatically higher DYHB RTTs introduced themselves as:
Service Name Nickname PeerID Connected? Since First Announced Version
storage linuxpal rpiw4n3ffzygwyzlkjb55upikk6mewtv Yes: to 18.62.1.14:55058 13:57:08 12-Aug-2010 21:26:57 11-Aug-2010 allmydata-tahoe/1.7.1
and
storage sunpal7 62nlabgfiuzbfseufd2yoymbjzsenbuz Yes: to 18.62.6.169:64882 14:25:33 12-Aug-2010 21:26:57 11-Aug-2010 allmydata-tahoe/1.7.1
I pinged their IP addresses:
--- 18.62.6.169 ping statistics --- 21 packets transmitted, 21 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 74.049/78.417/91.016/3.839 ms --- 18.62.1.14 ping statistics --- 21 packets transmitted, 21 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 73.153/78.478/92.159/4.260 ms
Changed at 2010-08-12T21:33:50Z by zooko
comment:7 Changed at 2010-08-12T21:34:16Z by zooko
- Keywords regression performance added
- Priority changed from major to critical
Okay, I've finally realized that this is a regression of the feature that we added in v1.6.0 to start fetching blocks as soon as you've learned about enough shares and to use the lowest-latency servers. Attached is the download status page from v1.7.1 of trying to download this same file from the same test grid. It performs much better:
$ wget http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg --2010-08-12 15:06:55-- http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg Resolving localhost... ::1, fe80::1, 127.0.0.1 Connecting to localhost|::1|:3456... failed: Connection refused. Connecting to localhost|fe80::1|:3456... failed: Connection refused. Connecting to localhost|127.0.0.1|:3456... connected. HTTP request sent, awaiting response... 200 OK Length: 1490710513 (1.4G) [application/ogg] Saving to: `bbb-360p24.i420.lossless.drc.ogg.fixed.ogg+bbb-24fps.flac.via-ffmpeg.ogg.1' 1% [=> ] 25,182,400 260K/s eta 1h 43m ^C
We can't release Tahoe-LAFS v1.8.0 with this behavior because it is a significant regression: people who use grids with slow or occasionally slow servers such as the public Test Grid would be ill-advised to upgrade from v1.7.1 to v1.8.0 and we don't like to release new versions that some users are ill-advised to upgrade to.
comment:8 Changed at 2010-08-12T21:46:52Z by zooko
I've noticed that when tickets get more than one attachment it becomes confusing for the reader to understand what is what, so here's a quick recap:
- http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1169/down-0.html ; attached to #1169, download using 1.8.0c2 with VLC.app from my home (initial ticket)
- attachment:down-1.html ; download using 1.8.0c2 with VLC.app from my office (comment:4)
- attachment:down-2.html ; download using 1.8.0c2 with wget from my office (comment:5)
- attachment:down-0.html ; download using 1.7.1 with wget from my office (comment:7)
comment:9 Changed at 2010-08-12T21:53:39Z by zooko
The feature that we released in v1.6.0 was ticket #928, and we did add some sort of unit tests for it, by making some servers not respond to DYHB at all: http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/test/test_hung_server.py?rev=37a242e01af6cf76
(In the pre-1.6.0 version, that situation would cause download to stall indefinitely, so that was our primary goal at that time and that is what the tests ensure no longer happens.)
comment:10 Changed at 2010-08-12T21:54:35Z by zooko
Note: the wget speed indicator is telling about "current" speed and so it varies a lot during a download. To get reliable speed measurements I guess I should let the wget finish which means, I suppose, I should download a smaller file! I would use the download status pages timings as an indicator of performance instead of the wget speed indicator.
comment:11 in reply to: ↑ 2 Changed at 2010-08-12T22:10:16Z by davidsarah
Replying to warner:
One clear improvement would be to fetch shares 0 and 5 from different servers: whatever slowed down the reads of sh0 also slowed down sh5.
Yes, I was going to point that out. Given that the DYHB responses were:
serverid sent received shnums RTT lv3fqmev +0.001393s +0.105560s 104ms tavrk54e +0.003111s +0.211572s 208ms xflgj7cg +0.004095s +0.111008s 107ms sp26qyqc +0.006173s +0.117722s 0,5 112ms sroojqcx +0.007326s +0.297761s 1,6 290ms 4rk5oqd7 +0.008324s +0.212271s 2 204ms nszizgf5 +0.009295s +0.204480s 3,7,8 195ms 62nlabgf +0.010490s +0.203262s 193ms fp3xjndg +0.011982s +0.242262s 4,9 230ms rpiw4n3f +0.013246s +0.113830s 101ms
- Should the downloader have chosen to get shares from sp26qyqc, nszizgf5 and 4rk5oqd7, rather than getting two shares from sp26qyqc and one from nszizgf5?
- The file has a happiness of 5. Shouldn't the uploader have distributed the shares more evenly?
comment:12 Changed at 2010-08-12T22:32:06Z by warner
Yeah, I'd like to see some more quantifiable data. It's a pity that the old-downloader doesn't provide as much information as the new one (a flog might help), but obviously I learned from experience with old-downloader while building the instrumentation on the new-downloader :).
The status data you show from both downloaders show a server in common, and the other server responded to the DYHB very quickly, so for at least the beginning of the download, I don't think the downloader has enough information to do any better.
Many of the new-downloader block-requests (I'm looking at the +179s to +181s mark) show correlated stalls of both the "fast" server (sp26) and the other "slow" server (nszi). If the problem were a single slow server, I'd expect to see big differences between the response times.
Interesting. So, the main known-problem with the new-downloader (at least the one on the top of my personal list) is its willingness to pull multiple shares from the same server (a "diversity failure"), which obviously has the potential to be slower than getting each share from a different server.
This is plausibly acceptable for the first segment, because the moment we receive the DYHB response that takes us above "k" shares, we're faced with a choice: start downloading now, or wait a while (how long??) in the hopes that new responses will increase diversity and result in a faster download.
But after the first segment, specifically after we've received the other DYHB responses, the downloader really ought to get as much diversity as it can, so pulling multiple shares from the same server (when there's an alternative) isn't excusable after that point.
The fix for this is to implement the next stage of the new-downloader project, which is to rank servers (and which-share-from-which-server mappings) according so some criteria (possibly speed, possibly cost, fairness, etc), and reevaluate that list after each segment is fetched. This is closely tied into the "OVERDUE" work, which is tied into the notion of cross-file longer-term server quality/reputation tracking, which is loosely tied into the notion of alternative backend server classes.
And I can't get that stage finished and tested in the next week, nor is a change that big a very stable thing to land this close to a release. So I'm hoping that further investigation will reveal something convenient, like maybe that 1.7.1 is actually just as variable as new-downloader on this grid, or that the two-shares-from-one-server problem isn't as bad as it first appears.
I *do* have a quick-and-dirty patch that might improve matters, which is worth experimenting with. I'll have to dig it out of a dark corner of my laptop, but IIRC it added an artificial half-second delay after receiving >=k shares from fewer than k servers. If new shares were found before that timer expired, the download would proceed with good diversity. If not, the download would begin with bad diversity after a small delay.
It fixed the basic problem, but I don't like arbitrary delays, and didn't address the deeper issue (you could still wind up pulling shares from slow servers even after you have evidence that there are faster ones available), so I didn't include it in #798.
comment:13 Changed at 2010-08-12T22:50:46Z by warner
RE davidsarah's comment:
Yeah, that's the sort of heuristic that I didn't want to guess at. It'll be easier to see this stuff when I land the visualization code. The arrival order of positive responses is:
- sp26 +117ms (sh0+sh5)
- nszi +204ms (sh3+sh7+sh8)
- 4rk5 +212ms (sh2)
- fp3x +242ms (sh4+sh9)
- sroo +298ms (sh1+sh6)
At +117ms, we don't have enough shares to download. At +204ms, we have enough shares but we'd like more diversity: we can't know that we could achieve our ideal diversity by waiting another 8 milliseconds, so we start downloading the first segment immediately.
By the time the second segment is started (at +977ms), we have a clearer picture of the available shares. We also have about 40kB of experience with each server (or 80kB for sp26, since we happened to fetch two shares from it), which we might use to make some guesses about speeds. When the second segment is started, at the very least we should prefer an arragement that gives us one share from each server. We might also like to prefer shares that we've already been using (since we'll have to fetch fewer hash-tree nodes to validate them); note that these two goals are already in conflict. We should prefer servers which gave us faster responses, if we believe that they're more likely to give fast responses in the future. But if we only hit up the really fast servers, they'll be using more bandwidth, which might cost them money, so they might prefer that we spread some of the load onto the slower servers, whatever we mutually think is fair.
And we need serendipity too: we should occasionally download a share from a random server, because it might be faster than any of the ones we're currently using, although maybe it won't be, so a random server may slow us down. All five of these goals conflict with each other, so there are weights and heuristics involved, which will change over time.
And we should remember some of this information beyond the end of a single download, rather than starting with an open mind each time, to improve overall efficiency.
So yeah, it's a small thread that, when tugged, pulls a giant elephant into the room. "No no, don't tug on that, you never know what it might be attached to".
So I'm hoping to find a quicker smaller solution for the short term.
comment:14 Changed at 2010-08-13T05:45:37Z by zooko
Brian asked for better measurements, and I ran quite a few (appended below). I think these results are of little use as they are very noisy and as far as I can tell I was just wrong when I thought, earlier today, that 1.8.0c2 was downloading this file slower than 1.7.1 did.
On the other hand I think these numbers are trying to tell us that something is wrong. Why does it occasionally take 40s to download 100K?
After I post this comment I will attach some status reports and flogs.
With v1.7.1 and no flogtool tail:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100K.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 8398 0 0:00:11 0:00:11 --:--:-- 25621 real 0m11.913s user 0m0.004s sys 0m0.006s
With 1.7.1 and flogtool tail:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100Kb.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 11044 0 0:00:09 0:00:09 --:--:-- 24679 real 0m9.062s user 0m0.003s sys 0m0.006s
v1.7.1 without tail:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100Kc.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 8134 0 0:00:12 0:00:12 --:--:-- 23310 real 0m12.301s user 0m0.004s sys 0m0.006s
v1.7.1 and flogtool tail:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100Kd.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 9050 0 0:00:11 0:00:11 --:--:-- 24716 real 0m11.057s user 0m0.004s sys 0m0.006s
Now switched from office to home. v1.7.1 and no flogtool tail, 1M:
$ time curl --range 0-1000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-1M.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 976k 100 976k 0 0 33899 0 0:00:29 0:00:29 --:--:-- 287k real 0m29.509s user 0m0.006s sys 0m0.013s
v1.7.1 and flogtool tail, 1M:
$ time curl --range 0-1000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-1M.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 976k 100 976k 0 0 51857 0 0:00:19 0:00:19 --:--:-- 228k real 0m19.294s user 0m0.005s sys 0m0.012s
v1.7.1 and no flogtool tail, 100K:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100K.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 2499 0 0:00:40 0:00:40 --:--:-- 25031 real 0m40.018s user 0m0.005s sys 0m0.011s
v1.7.1 and flogtool tail, 100K:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100K.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 5249 0 0:00:19 0:00:19 --:--:-- 24746 real 0m19.059s user 0m0.005s sys 0m0.009s
v1.8.0c2 and no flogtool tail, 100K:
$ time curl --range 0-100000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100K.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 97k 100 97k 0 0 80318 0 0:00:01 0:00:01 --:--:-- 80516 real 0m1.253s user 0m0.004s sys 0m0.005s
v1.8.0c2 and no flogtool tail, 1M:
$ time curl --range 0-1000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-1M.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 976k 100 976k 0 0 98k 0 0:00:09 0:00:09 --:--:-- 118k real 0m9.961s user 0m0.005s sys 0m0.015s
v1.8.0c2 and flogtool tail, 1M:
$ time curl --range 0-1000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-1M.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 976k 100 976k 0 0 210k 0 0:00:04 0:00:04 --:--:-- 268k real 0m4.645s user 0m0.004s sys 0m0.009s
Changed at 2010-08-13T05:47:09Z by zooko
Changed at 2010-08-13T05:47:25Z by zooko
Changed at 2010-08-13T05:49:41Z by zooko
comment:15 Changed at 2010-08-13T05:57:27Z by zooko
I just did one more download with 1.7.1 of 1M in order to get both the status page and the flog I named this download "run 99" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 99, at my home, with Tahoe-LAFS v1.7.1, the first 1M of http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg :
$ time curl --range 0-1000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-1M.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 976k 100 976k 0 0 38477 0 0:00:25 0:00:25 --:--:-- 245k' real 0m25.999s user 0m0.005s sys 0m0.012s
I will now attach the status output and flog of run 99.
Changed at 2010-08-13T05:57:55Z by zooko
Changed at 2010-08-13T05:59:08Z by zooko
comment:16 Changed at 2010-08-13T06:14:54Z by zooko
I just did one more download with 1.8.0c2 of 1M in order to get both the status page and the flog I named this download "run 100" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 100, at my home, with Tahoe-LAFS v1.8.0c2, the first 1M of http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg :
$ time curl --range 0-1000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-1M.ogg % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 976k 100 976k 0 0 139k 0 0:00:06 0:00:06 --:--:-- 169k real 0m7.377s user 0m0.004s sys 0m0.010s
Changed at 2010-08-13T06:15:25Z by zooko
Changed at 2010-08-13T06:17:47Z by zooko
comment:17 Changed at 2010-08-13T06:21:12Z by zooko
- Keywords regression removed
- Milestone changed from 1.8.0 to eventually
- Priority changed from critical to major
Okay there's no solid evidence that there is a regression from 1.7.1. I think Brian should use this ticket to analyze my flogs and status pages if he wants and then change it to be a ticket about download server selection. :-) Removing "regression".
comment:18 Changed at 2010-08-13T06:52:56Z by zooko
I just did one more download with 1.8.0c2 of 100M in order to get both the status page and the flog I named this download "run 101" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 101, at my home, with Tahoe-LAFS v1.8.0c2, the first 1M of http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg :
$ echo this is run 101 ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 101 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 90900 0 0:18:20 0:18:20 --:--:-- 96978 real 18m20.118s user 0m0.097s sys 0m0.536s
Changed at 2010-08-13T06:54:30Z by zooko
Changed at 2010-08-13T06:59:15Z by zooko
comment:19 Changed at 2010-08-13T07:18:54Z by zooko
I just did one more download with 1.7.1 of 100M in order to get both the status page and the flog I named this download "run 102" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 102, at my home, with Tahoe-LAFS v1.7.1, the first 1M of http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg :
$ echo this is run 102 ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 102 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 182k 0 0:08:54 0:08:54 --:--:-- 26.3M real 8m54.910s user 0m0.066s sys 0m0.478s
comment:20 Changed at 2010-08-13T07:19:10Z by zooko
Interesting that 1.7.1 was twice as fast as 1.8.0c2 this time.
Changed at 2010-08-13T07:19:48Z by zooko
comment:21 Changed at 2010-08-13T07:20:37Z by zooko
Annoyingly, 1.7.1 has a bug where it doesn't show downloads in the status page sometimes, and that happened this time, so I can't show you the status page for run 102.
comment:22 Changed at 2010-08-13T15:51:30Z by zooko
run 103 1.7.1 the first 100M http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg
$ echo this is run 103 ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 103 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 207k 0 0:07:50 0:07:50 --:--:-- 27.1M real 7m50.696s user 0m0.063s sys 0m0.469s
Changed at 2010-08-13T15:52:22Z by zooko
comment:23 Changed at 2010-08-13T17:51:36Z by zooko
run 104
1.8.0rc2-4698
the first 100M
$ echo this is run 104 ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 104 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 81763 0 0:20:23 0:20:23 --:--:-- 57095 real 20m23.119s user 0m0.102s sys 0m0.554s
Changed at 2010-08-13T17:54:57Z by zooko
comment:24 Changed at 2010-08-13T17:56:14Z by zooko
- Keywords regression added
- Milestone changed from eventually to 1.8.0
- Priority changed from major to critical
Hm, okay it really looks like there is a substantial (2X) slowdown for using Tahoe-LAFS v1.8.0c2 instead of v1.7.1 on today's (and yesterday's) Test Grid. I'm re-adding the regression tag which means I think this issue should block 1.8.0 release until we at least understand it better.
Changed at 2010-08-13T17:57:41Z by zooko
comment:25 Changed at 2010-08-13T18:45:51Z by zooko
run 105
1.7.1
the first 100M
$ echo this is run 105 tahoe-lafs v1.7.1 ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 105 tahoe-lafs v1.7.1 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 211k 0 0:07:40 0:07:40 --:--:-- 19.8M real 7m41.179s user 0m0.061s sys 0m0.481s
Changed at 2010-08-13T18:46:45Z by zooko
comment:26 Changed at 2010-08-13T19:10:39Z by zooko
run 106
1.8.0c2-r4698
the first 100M
$ echo this is run 106 tahoe-lafs v1.8.0c2-r4698 ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 106 tahoe-lafs v1.8.0c2-r4698 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 112k 0 0:14:29 0:14:29 --:--:-- 94083 real 14m29.309s user 0m0.082s sys 0m0.528s
Changed at 2010-08-13T19:11:57Z by zooko
Changed at 2010-08-13T19:12:57Z by zooko
comment:27 follow-up: ↓ 29 Changed at 2010-08-14T01:52:35Z by warner
I had an idea for a not-too-complex share-selection algorithm this morning:
- first, have the ShareFinder report all shares as soon as it learns about them, instead of its current behavior of withholding them until someone says that they're hungry
- also record the DYHB RTT in each Share object, so it can be found later. Keep the list of Shares sorted by this RTT (with share-number as the secondary sort key).
- then, each time the SegmentFetcher needs to start using a new share, use the following algorithm:
sharemap = {} # maps shnum to Share instance num_shares_per_server = {} # maps Server to a count of shares for max_shares_per_server in itertools.count(1): progress = False for sh in shares: if sh.shnum in sharemap: continue if num_shares_per_server[sh.server] >= max_shares_per_server: continue sharemap[sh.shnum] = sh num_shares_per_server[sh.server] += 1 progress = True if len(sharemap) >= k: return SUCCESS if not progress: return FAIL
The general idea is to cycle through all the shares we know about, but first try to build a sharemap that only uses one share per server (i.e. perfect diversity). That might fail because the shares are not diverse enough, so we can walk through the loop a second time and be willing to accept two shares per server. If that fails, we raise our willingness to three shares per server, etc. If we ever finish a loop without adding at least one share to our sharemap, we declare failure: this indicates that there are not enough distinct shares (that we know about so far) to succeed.
If this returns FAIL, that really means we should declare "hunger" and ask the ShareFinder to look for more shares. If we return SUCCESS but max_shares_per_server > 1, then we should ask for more shares too (but start the segment anyways: new shares may help the next segment do better).
This is still vulnerable to certain pathological situations, like if everybody has a copy of sh0 but only the first server has a copy of sh1: this will use sh0 from the first server then circle around and have to use sh1 from that server as well. A smarter algorithm would peek ahead, realize the scarcity of sh1, and add sh1 from the first server so it could get sh0 from one of the other servers instead.
But I think this might improve the diversity of downloads without going down the full itertools.combinations-enumerating route that represents the "complete" way to approach this problem.
comment:28 follow-up: ↓ 30 Changed at 2010-08-14T03:47:11Z by zooko
This seems promising. It sounds like you might think that the slowdown of 1.8.0c2 vs. 1.7.1 on the current Test Grid might be due to one server being used to serve two shares in 1.8.0c2 when two different servers would be used—one for each share—in 1.7.1. Is that what you think? Have you had a chance to look at my flogs attached to this ticket to confirm that this is what is happening?
comment:29 in reply to: ↑ 27 Changed at 2010-08-14T05:18:37Z by zooko
Replying to warner:
But I think this might improve the diversity of downloads without going down the full itertools.combinations-enumerating route that represents the "complete" way to approach this problem.
(Parenthetical historical observation which is pleasurable to me: Your heuristic algorithm for server selection (for download) in comment:27, and your observation that it is susceptible to failure in certain cases, is similar to my proposed heuristic algorithm for server selection for upload in #778 (comment:156:ticket:778, for the benefit of future cyborg archaeologist historians). David-Sarah then observed that finding the optimal solution was a standard graph theory problem named "maximum matching of a bipartite graph". Kevan then implemented it and thus we were able to finish #778.)
My copy of Cormen, Leiserson, Rivest 1st Ed. says (chapter 27.3) that the Ford-Fulkerson solution requires computation O(V * E) where V is the number of vertices (num servers plus num shares) and E is the number of edges (number of (server, share) tuples).
Now what Kevan actually implemented in happinessutil.py just returns the size of the maximum matching, and what we want here is an actual matching. I'm not 100% sure but I think if you save all the path's that are returned from augmenting_path_for() in servers_of_happiness() and return the resulting set of paths then you'll have your set of server->share mappings.
comment:30 in reply to: ↑ 28 Changed at 2010-08-14T06:08:34Z by zooko
Replying to zooko:
This seems promising. It sounds like you might think that the slowdown of 1.8.0c2 vs. 1.7.1 on the current Test Grid might be due to one server being used to serve two shares in 1.8.0c2 when two different servers would be used—one for each share—in 1.7.1.
Okay this does appear to be happening in at least one of the slow v1.8.0c2 downloads attached to this ticket. I looked at attachment:1.8.0c2-r4698-run-106-down-0.html and every request-block in it (for three different shares) went to the same server -- nszizgf5 -- which was the first server to respond to the DYHB (barely) and which happened to be the only server that had three shares. So at least for that run, Brian's idea that fetching blocks of different shares from the same server is a significant slowdown seems to be true.
comment:31 Changed at 2010-08-14T07:02:04Z by zooko
- Summary changed from does new-downloader perform badly for certain situations (such as today's Test Grid)? to new-downloader performs badly when the first server to reply to DYHB has K shares
comment:32 Changed at 2010-08-15T04:08:45Z by zooko
In http://tahoe-lafs.org/pipermail/tahoe-dev/2010-August/004998.html I wrote:
Hey waitasecond. As far as I understand, Tahoe-LAFS v1.7.1 should also—just like v1.8.0c2—start downloading all three shares from Greg's server as soon as that server is the first responder to the DYHB:
Am I misunderstanding? So the question of why 1.7.1 seems to download 2 or 3 times as fast as 1.8.0c2 on this grid remains open.
comment:33 Changed at 2010-08-16T05:25:21Z by zooko
run 107
1.7.1
the first 10MB
with cProfile profiling running but no flog running
$ echo this is run 107 tahoe-lafs v1.7.1 cProfile ; time curl --range 0-10000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-10M.ogg this is run 107 tahoe-lafs v1.7.1 cProfile % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9765k 100 9765k 0 0 171k 0 0:00:57 0:00:57 --:--:-- 2424k real 0m57.139s user 0m0.010s sys 0m0.059s
Changed at 2010-08-16T05:28:11Z by zooko
comment:34 Changed at 2010-08-16T05:36:27Z by zooko
run 108
1.8.0c2
the first 10MB
with cProfile profiling running but no flog running
$ echo this is run 108 tahoe-lafs v1.8.0c2 cProfile ; time curl --range 0-10000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-10M.ogg this is run 108 tahoe-lafs v1.8.0c2 cProfile % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9765k 100 9765k 0 0 247k 0 0:00:39 0:00:39 --:--:-- 264k real 0m39.496s user 0m0.009s sys 0m0.052s
Changed at 2010-08-16T05:37:29Z by zooko
comment:35 Changed at 2010-08-16T05:53:28Z by zooko
run 109
1.7.1
the first 100MB
with cProfile profiling running but no flog running
$ echo this is run 109 tahoe-lafs v1.7.1 cProfile ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 109 tahoe-lafs v1.7.1 cProfile % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 228k 0 0:07:06 0:07:06 --:--:-- 21.5M real 7m6.626s user 0m0.059s sys 0m0.460s
Changed at 2010-08-16T05:53:54Z by zooko
comment:36 Changed at 2010-08-16T06:13:37Z by zooko
run 110
1.8.0c2
the first 100MB
with cProfile profiling running but no flog running
$ echo this is run 110 tahoe-lafs v1.8.0c2 cProfile ; time curl --range 0-100000000 http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb-100M.ogg this is run 110 tahoe-lafs v1.8.0c2 cProfile % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 95.3M 100 95.3M 0 0 113k 0 0:14:19 0:14:19 --:--:-- 98k real 14m19.256s user 0m0.079s sys 0m0.504s
Changed at 2010-08-16T06:14:33Z by zooko
Changed at 2010-08-16T06:17:45Z by zooko
comment:37 follow-up: ↓ 44 Changed at 2010-08-16T06:22:55Z by zooko
- Summary changed from new-downloader performs badly when the first server to reply to DYHB has K shares to new-downloader performs badly when downloading a lot of data from a file
Okay, the problem with the current downloader in 1.8.0c2 is that it goes slower and slower as it downloads more and more data from a file. It consistently wins (or at least ties) 1.7.1 in downloads <= 10MB but consistently loses badly for 100 MB. Also the profiling result in attachment:prof-run-110-dump.txt shows major CPU usage in spans:
661324561 function calls (661269650 primitive calls) in 919.130 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 91950 0.464 0.000 0.646 0.000 spans.py:142(dump) 1374341 0.310 0.000 0.310 0.000 spans.py:152(__iter__) 41352 0.062 0.000 0.149 0.000 spans.py:156(__nonzero__) 142497 0.194 0.000 0.253 0.000 spans.py:159(len) 18390 0.059 0.000 0.438 0.000 spans.py:164(__add__) 22962 1.083 0.000 8.079 0.000 spans.py:170(__sub__) 9195 0.034 0.000 0.346 0.000 spans.py:186(__and__) 153765314 121.874 0.000 197.580 0.000 spans.py:203(overlap) 153189988 54.374 0.000 54.374 0.000 spans.py:215(adjacent) 18390 1.032 0.000 1.343 0.000 spans.py:238(len) 18390 5.727 0.000 7.413 0.000 spans.py:248(dump) 87347 1.430 0.000 460.111 0.005 spans.py:25(__init__) 9195 0.633 0.000 459.852 0.050 spans.py:256(get_spans) 6906 0.114 0.000 0.115 0.000 spans.py:260(assert_invariants) 9193 0.305 0.000 0.372 0.000 spans.py:271(get) 6906 2.174 0.000 3.233 0.000 spans.py:295(add) 2678832 32.014 0.000 46.032 0.000 spans.py:34(_check) 6904 0.742 0.000 1.630 0.000 spans.py:389(remove) 3436 0.007 0.000 0.042 0.000 spans.py:434(pop) 1315865 140.193 0.000 458.353 0.000 spans.py:46(add) 1275620 2.471 0.000 6.287 0.000 spans.py:82(remove)
comment:38 Changed at 2010-08-16T07:01:29Z by zooko
Looking at immutable/downloader/share.py, I have the following review comments:
- This comment confuses me:
59 self._pending = Spans() # request sent but no response received yet 60 self._received = DataSpans() # ACK response received, with data 61 self._unavailable = Spans() # NAK response received, no data 62 63 # any given byte of the share can be in one of four states: 64 # in: _wanted, _requested, _received 65 # FALSE FALSE FALSE : don't care about it at all 66 # TRUE FALSE FALSE : want it, haven't yet asked for it 67 # TRUE TRUE FALSE : request is in-flight 68 # or didn't get it 69 # FALSE TRUE TRUE : got it, haven't used it yet 70 # FALSE TRUE FALSE : got it and used it 71 # FALSE FALSE FALSE : block consumed, ready to ask again
Are _wanted, _requested, and _received old names for _pending, _received, and _unavailable? Or perhaps from a different design entirely? And that's six states, not four.
- A span is add'ed to _pending in _send_requests() and removed from _pending in _got_data() but is not removed if the request errbacks instead of callbacks. That would be a bug for it still to be marked as "pending" after the request errbacked, wouldn't it?
- We shouldn't give the author of a file the ability to raise AssertionError from immutable/downloader/share.py line 416 _satisfy_share_hash_tree() but instead give him the ability to cause _satisfy_offsets() to raise a LayoutInvalid exception (see related ticket #1085 (we shouldn't use "assert" to validate incoming data in introducer client))
comment:39 Changed at 2010-08-16T07:18:19Z by zooko
This looks like a bug: share.py _got_data():
self._received.add(start, data)
That could explain the slowdown -- the items added to _received here are never removed, because the removal code in _satisfy_block_data() is:
block = self._received.pop(blockstart, blocklen)
I added the following assertions to trunk/src/allmydata/util/spans.py@4666:
hunk ./src/allmydata/util/spans.py 47 raise def add(self, start, length): + assert isinstance(start, (int, long)) + assert isinstance(length, (int, long)) assert start >= 0 assert length > 0 #print " ADD [%d+%d -%d) to %s" % (start, length, start+length, self.dump())
And indeed these assertions fail because data is not an integer. However, then when I add this patch:
hunk ./src/allmydata/immutable/downloader/share.py 741 share=repr(self), start=start, length=length, datalen=len(data), level=log.NOISY, parent=lp, umid="5Qn6VQ") self._pending.remove(start, length) - self._received.add(start, data) + self._received.add(start, length) # if we ask for [a:c], and we get back [a:b] (b<c), that means we're # never going to get [b:c]. If we really need that data, this block
This causes a bunch of tests to fail in ways that I don't understand.
comment:40 Changed at 2010-08-16T07:35:04Z by zooko
more review notes:
- In spans add() it would be more efficient to use bisect.insort()
- In spans remove() it would be more efficient to insert the new span, e.g. replace
self._spans[i] = (left_start, left_length) self._spans.append( (right_start, right_length) ) self._spans.sort()
with
self._spans[i] = (left_start, left_length) self._spans.insert(i+1, (right_start, right_length))
- DataSpans.add() calls assert_invariants() which iterates over all the spans. Also DataSpans.add() itself searches for where to make modifications by iterating from the beginning, which seems unnecessary. Couldn't it do a binary search to find the place it needs to modify and then modify only a local neighborhood there?
comment:41 Changed at 2010-08-16T07:37:16Z by zooko
In immutable/downloader/share.py _unavailable can have things added to it (in case of intentional over-read or in case of failing/corrupted server) but never has things removed from it. Does that matter? I suspect that it is intentional and doesn't hurt but I'm not sure.
comment:42 Changed at 2010-08-16T14:17:54Z by zooko
Also Brian discovered another bug in new-downloader last night. Here are some excerpts from IRC.
<warner> hm, ok, so my new share-selection algorithm is, I think, revealing a bug in the rest of the downloader code [23:54] <warner> there's a funny place where the number of segments in the file (needed to build the block-hash-tree) is not known right away, because we have to get the UEB to know for sure [23:55] <warner> so the CommonShare object speculatively creates an IncompleteHashTree, with a guessed value for numsegs <warner> and then replaces it later <warner> (in hindsight, it's probably not a good idea to do that.. I should just leave it as None until we know for sure) [23:56] <warner> the test failure is that the first segment uses sh0,sh1,sh5 , and all have their numsegs updated, but the second segment then switches to some different shares, and those new ones are still using the guessed value for numsegs [23:57] <warner> seg1 used sh0,1,6 <zooko> Hm. [23:58] <zooko> Good find! <warner> and sh6 threw an exception because of the wrong-sized hashtree, so it fell back to sh5 <warner> each time it tried to use a share that wasn't part of the original set, it got that error and returned back to sh0,1,5 [23:59]
comment:43 follow-up: ↓ 46 Changed at 2010-08-16T18:14:04Z by warner
Responses to earlier comments:
- spans.py definitely looks O()-suspicious. Does attachment:prof-run-110-dump.txt suggest that in a 100MB file download, we spent half of the total time in Spans.__init__?
- yes, the comment about _wanted, _requested, _received is stale.
- the failure to remove data from _pending upon errback is a bug
- LayoutInvalid is better than assert, yeah
- the self._received.add(start,data) is correct: _received is a
DataSpans instance, not Spans, and it holds strings, not
booleans. _received holds the data that comes back from the server
until the "satisfy" code consumes it. It has methods like get and
pop, whereas the simpler Spans class merely has methods for
is-range-in-span.
- that said, I don't understand how assertions added to spans.py#L47 would ever fire. If those same assertions were added to spans.py#L295, I'd get it. What types were start/length in your observed assertion failures? And what was the stack trace?
- The patch to call self._received.add(start, length) is wrong; it must be called with (int,str).
- all the comments about efficiency improvements in Spans are probably correct
- adding data to _unavailable should be benign: the amount of unavailable data is small and constant (if the share is intact, we should only add to _unavailable during the first few reads if we've guessed the segsize wrong).
Now some new ideas. I've found a couple of likely issues.
- looking at timestamps from flog-run-106, the inter-segment timing
definitely is growing over the course of the download. It's noisy,
but it goes from about 0.8s at the start (seg0), to about 1.5s-2.0s at the
end (seg762). I haven't looked at smaller deltas (i.e. only inside the
"desire" code) to rule out network variations, but it certainly points to
a leak or complexity increase of some sort that gets worse as the download
progresses.
- comparing downloads of the first 100MB, the middle 100MB, and the last 100MB would rule out anything that's influenced by the absolute segment number.
- looking at the Spans.dump() strings in the flog, I see that two of
the three shares (sh7+sh8) have an ever-growing .received
DataSpans structure. A cursory check suggests they are growing by 64
bytes and one range per segment. By the end of the download (seg762), sh7
is holding 37170 bytes in 378 ranges (whereas sh3 only has 1636 bytes in
22 ranges, and remains mostly constant)
- This means that we're asking for data which we then don't end up using. We keep it around in _received because it might be useful later: maybe we ask for the wrong data because our guess of the segsize (and thus numsegs, and thus the size/placement of the hashtrees) was wrong. But later we might take advantage of whatever we fetched by mistake.
- I have two theories:
- the IncompleteHashTree.needed_hashes() call, when asked what hashes we need to validate leaf 0, might tell us we need the hash for leaf 0 too. However, that hash can be computed from the block of data that makes up the leaf, so we don't really need to fetch it. (whereas we *do* need the hash for leaf 1, since it sits on the "uncle chain" for leaf0). If the desire-side code is conservatively/incorrectly asking for the leaf0 hash, but the satisfy-side code doesn't use it, then we'll add a single 32-byte hash node per segment.
- ciphertext hash tree nodes: when we start working on a segment, the desire-side code will ask for ciphertext hash tree nodes from each segment we're using. However, the satisfy-side code will only use the hashes from the first response: by the time the second response arrives, the ciphertext hash tree is satisfied, so that clause isn't reached. This means that we'll leave that data in ._received forever. This seems most likely: it would explain why the first share (sh3) doesn't grow, whereas the later two shares do, and why I saw a 64-byte increment (the actual growth would depend upon the segment number, and how many new uncle-chain nodes are needed, but 2-nodes is a fairly common value).
- The .received leftover-data issue shouldn't be such a big deal, N=378 is not a huge number, but the measured increase in inter-segment time suggests that whatever the O() complexity is, N=378 is enough to cause problems.
So I think the next directions to pursue are:
- build some kind of test framework to exercise a large download without using real remote_read calls, ideally 100MB or 1GB in a few seconds. This would use a Share subclass that returns data immediately (well, after a eventual-send) rather than ever touching a server. It might also need to stub out some of the hashtree checks, but does need real needed_hashes computations. Then we fix the code until this test finishes in a reasonable amount of time. While I wouldn't have the test case assert anything about runtime, I would have it assert things like ._received doesn't grow over the course of the test.
- measure the CPU seconds needed to download a 100MB file from both old-downloader and new-downloader: if we're looking at a O(n3) problem, it will manifest as a much heavier CPU load. (if we were merely looking at a pipelining failure, the CPU time would be the same, but wallclock time would be higher).
- stare at Spans (and specifically DataSpans) for computational-complexity problems. Build some tests of these with N=400ish and see how efficient they are. They're supposed to be linear wrt number-of-ranges, but either they aren't, or they're being called in a way which makes it worse
- consider commenting out the dump() calls for a test, or some assert_invariants calls, to see if we're hitting that old problem where the data structure is efficient unless we leave in the self-checks or debugging messages
- (hard) figure out how to drop the unused ciphertext-hash-tree nodes from the second and later shares
- look at IncompleteHashTree.needed_hashes and see if we're actually requesting the leaf node that we don't really need.
- consider an entirely different DataSpans structure. The
perhaps-too-clever overlap/merge behavior is mostly just exercised during
the fetch of the first segment, before we're sure about the correct number
of segments (we fetch some data speculatively to reduce roundtrips; if we
guess wrong, we'll get the wrong data, but DataSpans lets us easily
use that data later if it turns out to be what we needed for some other
purpose). Perhaps a data structure which was less tuned for merging
adjacent ranges would be better, maybe one which has an explicit
merge() method that's only called just before the requests are sent
out. Or maybe the value of holding on to that data isn't enough to justify
the complexity.
- a related thought (mostly to remind me of the idea later): for pipelining purposes, I'd like to be able to label the bits in a Spans with their purpose: if we send parallel requests for both seg2 and seg3, I'd like the seg2 data to arrive first, so e.g. the hashes needed to validate seg2 should arrive before the bulk block data for seg3. A label on the bits like "this is for seg2" would let us order the requests in such a way to reduce our memory footprint. A label like this might also be useful for handling the unused-ciphertext-hash-tree-nodes problem, if we could remove data from a DataSpans that's labelled with an already-complete segnum.
Finally, the bug zooko mentioned in comment:42 is real. I'm still working on it, but basically it prevents us from using shares that arrive after the initial batch of requests: they are not initialized properly and don't get a correct block hash tree. I'm working on a fix. The symptom is that we fall back to the initial shares, but if those have died, the download will fail, which is wrong.
And I'm still working on the new share-selection algorithm. The code works, and my basic unit tests work, but certain ones require the comment:42 bug to be fixed before it is safe to use (the bug will hurt current downloads, but occurs less frequently).
Changed at 2010-08-16T21:54:35Z by davidsarah
Short-term hack to test for asymptotic inefficiency of DataSpans?.get_spans
comment:44 in reply to: ↑ 37 ; follow-up: ↓ 45 Changed at 2010-08-16T22:13:42Z by davidsarah
Replying to zooko:
Okay, the problem with the current downloader in 1.8.0c2 is that it goes slower and slower as it downloads more and more data from a file. It consistently wins (or at least ties) 1.7.1 in downloads <= 10MB but consistently loses badly for 100 MB. Also the profiling result in attachment:prof-run-110-dump.txt shows major CPU usage in spans:
ncalls tottime percall cumtime percall filename:lineno(function) ... 87347 1.430 0.000 460.111 0.005 spans.py:25(__init__) 9195 0.633 0.000 459.852 0.050 spans.py:256(get_spans) ... 1315865 140.193 0.000 458.353 0.000 spans.py:46(add)
This is the smoking gun. The code of DataSpans.get_spans is:
def get_spans(self): """Return a Spans object with a bit set for each byte I hold""" return Spans([(start, len(data)) for (start,data) in self.spans])
and the Spans constructor has the loop:
for (start,length) in _span_or_start: self.add(start, length)
Spans.add does a linear search (plus a sort, if there is no overlap, but Timsort takes linear time for an already-sorted array), so the overall complexity of DataSpans.get_spans is Θ(n2) where n is the number of spans.
Since Spans uses essentially the same invariant as DataSpans for its array of spans (they are sorted with no overlaps or adjacency), it is possible to implement get_spans in Θ(1) time. However I suspect that the important difference here is between Θ(n2) and Θ(n).
The diff's implementation of get_spans includes a call to s._check. It may also be worth doing another profile run without that call.
(Some of my comments in ticket:798#comment:18 would reduce the number of calls to overlap and eliminate calls to adjacent, but I don't think that's the critical issue by itself.)
comment:45 in reply to: ↑ 44 Changed at 2010-08-16T23:06:37Z by davidsarah
Replying to davidsarah:
Spans.add does a linear search (plus a sort, if there is no overlap, but Timsort takes linear time for an already-sorted array), so the overall complexity of DataSpans.get_spans is Θ(n2) where n is the number of spans.
Since Spans uses essentially the same invariant as DataSpans for its array of spans (they are sorted with no overlaps or adjacency), it is possible to implement get_spans in Θ(1) time. However I suspect that the important difference here is between Θ(n2) and Θ(n).
Note that, given this problem and Brian's observations in comment:43, the overall time for a download will be Θ(n3). So maybe we do need a better data structure (some sort of balanced tree or heap, maybe) if we want to get to Θ(n log n) rather than Θ(n2) for the whole download. But maybe that can wait until after releasing 1.8.
(Actually, just logging the output of Spans.dump calls will by itself cause Θ(n2) behaviour for the whole download, although with a fairly small constant.)
comment:46 in reply to: ↑ 43 Changed at 2010-08-17T05:19:10Z by zooko
Replying to warner:
- that said, I don't understand how assertions added to spans.py#L47 would ever fire. If those same assertions were added to spans.py#L295, I'd get it. What types were start/length in your observed assertion failures? And what was the stack trace?
This was my mistake. I must have confused it with a different test run. Those assertions never fire.
comment:47 Changed at 2010-08-18T12:57:33Z by zooko
run 111
1.8.0c2
requesting all of the file
with flog running
$ echo this is run 111 tahoe-lafs v1.8.0c2-r4699+ ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 111 tahoe-lafs v1.8.0c2-r4699+ % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 27 1421M 27 394M 0 0 18995 0 21:47:59 6:02:29 15:45:30 0^C real 362m29.965s user 0m1.123s sys 0m4.187s
comment:48 Changed at 2010-08-18T12:58:38Z by zooko
Oh, and in run 111 (comment:47) I had added log messages for all events which touched the Share._received Spans object so the resulting flogfile is a trace of everything that affects that object.
comment:49 Changed at 2010-08-18T14:07:45Z by zooko
The following run has patch attachment:spans.py.diff.
run 112
1.8.0c2
requesting all of the file
with flog running
$ echo this is run 112 tahoe-lafs v1.8.0c2-r4699+ ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 112 tahoe-lafs v1.8.0c2-r4699+ % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 30 1421M 30 435M 0 0 113k 0 3:33:42 1:05:29 2:28:13 139k^C real 65m29.907s user 0m0.350s sys 0m2.302s
comment:50 Changed at 2010-08-18T15:57:23Z by zooko
The patch helped a lot—compare run 112 to 111—but not enough to make trunk as fast as 1.7.1 on large downoads—compare run 112 to runs 102, 103, 105, and 109.
I intend to write a tool which reads the traces of what was done to the Share._received Spans object and does those operations to a Spans object so that we can run benchmark it and profile it in isolation.
run | version | downloaded | download KBps | flags |
---|---|---|---|---|
101 | 1.8.0c2 | 100 MB | 91 | |
102 | 1.7.1 | 100 MB | 182 | |
103 | 1.7.1 | 100 MB | 207 | |
104 | 1.8.0c2 | 100 MB | 82 | |
105 | 1.7.1 | 100 MB | 211 | |
109 | 1.7.1 | 100 MB | 228 | cProfile |
110 | 1.8.0c2 | 100 MB | 113 | cProfile |
111 | 1.8.0c2 | 413 MB | 19 | spanstrace |
112 | 1.8.0c2 | 456 MB | 113 | spanstrace, patched |
Changed at 2010-08-18T15:58:51Z by zooko
Changed at 2010-08-18T16:01:33Z by zooko
Changed at 2010-08-18T16:03:04Z by zooko
comment:51 Changed at 2010-08-18T16:04:20Z by zooko
attachment:debuggery-trace-spans.dpatch.txt adds logging of all events that touched Share._received at loglevel CURIOUS. attachment:run-111-above28-flog.pickle.bz2 and attachment:run-112-above28-flog.pickle.bz2 are the flogs from run 111 and run 112 with only events logged at level CURIOUS or above.
comment:52 Changed at 2010-08-18T17:02:52Z by zooko
run | version | downloaded | download KBps | flags |
---|---|---|---|---|
113 | 1.7.1 | 543 MB | 241 | on office network |
$ echo this is run 113 tahoe-lafs v1.7.1 ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 113 tahoe-lafs v1.7.1 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 36 1421M 36 518M 0 0 241k 0 1:40:26 0:36:37 1:03:49 257k^C real 36m39.238s user 0m0.329s sys 0m2.733s
comment:53 Changed at 2010-08-18T18:56:04Z by warner
BTW, be sure to pay attention to the DataSpans too, specifically Share._received . That's the one that I observed growing linearly with number-of-segments-read.
I'm close to finishing my rework of the way Shares are handled. If we can make new-downloader fast enough by fixing complexity issues in spans.py, we should stick with that for 1.8.0, because those are probably smaller and less intrusive changes. If not, here are the properties of my Share-handling changes:
- use a new diversity-seeking Share selection algorithm, as described in comment:27 . This should distribute the download load evenly among all known servers when they have equal number of shares, and as evenly as possible (while still getting k shares) when not. If more shares are discovered later, the algorithm will recalculate the sharemap and take advantage of the new shares, and we'll keep looking for new shares as long as we don't have the diversity that we want (one share per server).
- fix the problem in which late shares (not used for the first segment, but located and added later) were not given the right sized hashtree and threw errors, causing them to be dropped. I think this completely broke the "tolerate loss of servers" feature, but the problem might have been caused by the diversity-seeking algorithm change, rather than something that was in new-downloader originally.
- deliver all shares to the SegmentFetcher as soon as we learn about them, instead of waiting for the fetcher to tell us it's hungry. This gives the fetcher more information to work with.
I might be able to attach a patch tomorrow.. there are still some bugs in it, and I haven't finished implementing the last point (push shares on discovery, not pull on hunger).
comment:54 Changed at 2010-08-18T23:38:51Z by warner
Oh, hey, here's a simple patch to try out:
diff --git a/src/allmydata/immutable/downloader/share.py b/src/allmydata/immutable/downloader/share.py index f7ed4e8..413f907 100644 --- a/src/allmydata/immutable/downloader/share.py +++ b/src/allmydata/immutable/downloader/share.py @@ -531,6 +531,9 @@ class Share: for o in observers: # goes to SegmentFetcher._block_request_activity o.notify(state=COMPLETE, block=block) + # now clear our received data, to dodge the #1170 spans.py + # complexity bug + self._received = DataSpans() except (BadHashError, NotEnoughHashesError), e: # rats, we have a corrupt block. Notify our clients that they # need to look elsewhere, and advise the server. Unlike
Since self._received is supposed to be empty after each segment is complete (unless we guess the segsize wrong), this patch simply manually empties it at that point. No data is retained from one segment to the next: any mistakes will just cause us to ask for more data next time.
If the problem in this bug is a computational complexity in DataSpans, this should bypass it, by making sure we never add more than 3 or 4 ranges to one, since even O(n^3) is small when n is only 3 or 4. (we should still fix the problem, but maybe the fix can wait for 1.8.1). If the problem is in Spans, or elsewhere, then this won't help.
Changed at 2010-08-19T02:26:36Z by zooko
comment:55 follow-up: ↓ 56 Changed at 2010-08-19T02:31:18Z by zooko
attachment:run-112-above28-flog-dump-sh8-on-nsziz.txt is a flogtool dump of attachment:attachment:run-112-above28-flog.pickle.bz2 grepped for just one particular share (sh8 on nsziz). It is suitable as the input file for misc/simulators/bench_spans.py.
comment:56 in reply to: ↑ 55 Changed at 2010-08-19T02:31:47Z by zooko
attachment:run-112-above28-flog-dump-sh8-on-nsziz.txt is a flogtool dump of attachment:run-112-above28-flog.pickle.bz2 grepped for just one particular share (sh8 on nsziz). It is suitable as the input file for misc/simulators/bench_spans.py.
The output that I get on my Macbook Pro is:
MAIN Zooko-Ofsimplegeos-MacBook-Pro:~/playground/tahoe-lafs/trunk$ PYTHONPATH=~/playground/tahoe-lafs/trunk/support/lib/python2.6/site-packages/ python ~/playground/tahoe-lafs/trunk/misc/simulators/bench_spans.py ~/Public/Drop\ Box/run-112-above28-flog-dump-sh8-on-nsziz.txt all results are in time units per N time units per second: 1000000; seconds per time unit: 0.000001 (microseconds) 600 best: 2.265e+01, 3th-best: 2.402e+01, mean: 2.462e+01, 3th-worst: 2.502e+01, worst: 2.585e+01 (of 10) 6000 best: 1.069e+02, 3th-best: 1.119e+02, mean: 1.137e+02, 3th-worst: 1.149e+02, worst: 1.201e+02 (of 10) 60000 best: 2.916e+01, 1th-best: 2.916e+01, mean: 5.080e+02, 1th-worst: 9.868e+02, worst: 9.868e+02 (of 2)
This is even though I have attachment:spans.py.diff applied.
comment:57 Changed at 2010-08-19T04:27:20Z by zooko
Okay, the patch from comment:54 seems to have improved performance significantly. I just performed run 114:
MUSI Zooko-Ofsimplegeos-MacBook-Pro:~/Desktop$ echo this is run 114 tahoe-lafs v1.8.0c2-r4699+comment:54 ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 114 tahoe-lafs v1.8.0c2-r4699+comment:54 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 10 1421M 10 146M 0 0 149k 0 2:42:42 0:16:42 2:26:00 159k^R 30 1421M 30 435M 0 0 147k 0 2:44:53 0:50:30 1:54:23 144k^C real 50m30.207s user 0m0.290s sys 0m2.112s
Here is the full table:
run | version | downloaded | download KBps | flags |
---|---|---|---|---|
101 | 1.8.0c2 | 100 MB | 91 | |
102 | 1.7.1 | 100 MB | 182 | |
103 | 1.7.1 | 100 MB | 207 | |
104 | 1.8.0c2 | 100 MB | 82 | |
105 | 1.7.1 | 100 MB | 211 | |
109 | 1.7.1 | 100 MB | 228 | cProfile |
110 | 1.8.0c2 | 100 MB | 113 | cProfile |
111 | 1.8.0c2 | 413 MB | 19 | spanstrace |
112 | 1.8.0c2 | 456 MB | 113 | spanstrace, spans.py.diff |
113 | 1.7.1 | 543 MB | 241 | on office network |
114 | 1.8.0c2 | 456 MB | 147 | spans.py.diff + comment:54 |
I'm not sure if v1.8.0c2 is now good enough to be considered "not a significant regression" vs. v1.7.1 for downloading large files. I'll go download a large file with v1.7.1 now on my home network for comparison...
comment:58 Changed at 2010-08-19T05:07:19Z by zooko
Hm, it seems like v1.7.1 is still substantially faster than v1.8.0c2+comment:54:
run | version | downloaded | download KBps | flags |
---|---|---|---|---|
115 | 1.7.1 | 456 MB | 224 |
MUSI Zooko-Ofsimplegeos-MacBook-Pro:~/Desktop$ echo this is run 115 tahoe-lafs v1.7.1 ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 115 tahoe-lafs v1.7.1 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 30 1421M 30 435M 0 0 224k 0 1:48:09 0:33:06 1:15:03 240k^C real 33m6.746s user 0m0.287s sys 0m2.307s
comment:59 Changed at 2010-08-19T05:44:58Z by zooko
Well the good news is that comment:54 fixes the problem that downloads go slower the bigger they are (as expected). The bad news is that even with comment:54 Tahoe-LAFS v1.8.0c2 is substantially slower than v1.7.1 for large files:
run | version | downloaded | download KBps | flags |
---|---|---|---|---|
116 | 1.8.0c2 | 314 MB | 154 | spans.py.diff + comment:54 |
MUSI Zooko-Ofsimplegeos-MacBook-Pro:~/Desktop$ echo this is run 115 tahoe-lafs v1.8.0c2+comment:54 ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 115 tahoe-lafs v1.8.0c2+comment:54 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 21 1421M 21 300M 0 0 154k 0 2:37:11 0:33:16 2:03:55 160k^C real 33m16.529s user 0m0.203s sys 0m1.507s
I'm going to start another run with v1.8.0c2, this time with the cProfile tool running, and go to sleep.
Changed at 2010-08-19T06:01:17Z by zooko
comment:60 Changed at 2010-08-19T06:01:41Z by zooko
I ran 1.8.0c2 under the profiler for a few minutes and then stopped it in order to get the profiling stats (attached). Unfortunately, they do not show any more smoking gun of CPU usage, so the remaining slowdown from v1.7.1 to v1.8.0c2 is likely to be one of the network-scheduling issues that Brian has been thinking about (server selection, pipelining), or else some other sort of subtle timing issue...
Here are the profiling stats for a brief (~4 minute) run of 1.8.0c2:
MUSI Zooko-Ofsimplegeos-MacBook-Pro:~/Desktop$ echo this is run 117 tahoe-lafs v1.8.0c2+comment:54 cProfile ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 117 tahoe-lafs v1.8.0c2+comment:54 cProfile % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 2 1421M 2 35.1M 0 0 160k 0 2:30:45 0:03:43 2:27:02 169k^C real 3m43.778s user 0m0.027s sys 0m0.210s
The functions with the most "cumtime" (time spent in the function or in any of the functions that it called) are:
1 0.000 0.000 275.676 275.676 base.py:1156(run) 1 0.175 0.175 275.675 275.675 base.py:1161(mainLoop) 30070 0.576 0.000 267.477 0.009 selectreactor.py:93(doSelect) 30070 259.463 0.009 260.149 0.009 {select.select} 30070 0.191 0.000 7.795 0.000 base.py:751(runUntilCurrent) 3724 0.031 0.000 7.205 0.002 eventual.py:18(_turn) 27682 0.300 0.000 6.634 0.000 log.py:71(callWithLogger) 27682 0.242 0.000 6.297 0.000 log.py:66(callWithContext) 3453 0.075 0.000 5.742 0.002 share.py:187(loop) 27682 0.116 0.000 5.719 0.000 context.py:58(callWithContext) 27682 0.199 0.000 5.587 0.000 context.py:32(callWithContext) 27681 0.228 0.000 5.310 0.000 selectreactor.py:144(_doReadOrWrite) 3453 0.089 0.000 5.299 0.002 share.py:238(_do_loop) 26522 0.347 0.000 4.397 0.000 tcp.py:443(doRead) 26506 0.306 0.000 4.376 0.000 tcp.py:114(doRead) 28407/17703 0.257 0.000 3.770 0.000 defer.py:453(_runCallbacks) 13880/11193 0.031 0.000 2.824 0.000 defer.py:338(callback) 13983/11295 0.031 0.000 2.795 0.000 defer.py:433(_startRunCallbacks) 3064/3052 0.014 0.000 2.658 0.001 defer.py:108(maybeDeferred) 3453 0.092 0.000 2.654 0.001 share.py:701(_send_requests) 2639 0.007 0.000 2.032 0.001 referenceable.py:406(callRemote) 2608 0.034 0.000 2.025 0.001 banana.py:633(dataReceived) 2639 0.052 0.000 2.012 0.001 referenceable.py:418(_callRemote) 2604 0.005 0.000 2.009 0.001 share.py:733(_send_request) 2608 0.344 0.000 1.988 0.001 banana.py:701(handleData) 2640 0.008 0.000 1.791 0.001 banana.py:183(send) 2640 0.021 0.000 1.783 0.001 root.py:92(send) 2651 0.199 0.000 1.683 0.001 banana.py:191(produce) 26506 1.680 0.000 1.680 0.000 {built-in method recv} 28693/25918 0.072 0.000 1.651 0.000 defer.py:266(addCallbacks) 4299 0.044 0.000 1.494 0.000 share.py:267(_get_satisfaction) 12418 0.125 0.000 1.052 0.000 hashtree.py:298(needed_hashes)
I'll go ahead and leave a download running under the profiler overnight just in case something turns up.
Changed at 2010-08-19T06:04:37Z by zooko
Changed at 2010-08-19T06:06:35Z by zooko
comment:61 follow-up: ↓ 67 Changed at 2010-08-19T06:11:10Z by zooko
If you wanted to investigate why 1.8.0c2 is so much slower than 1.7.1 at downloading a large file even after applying the comment:54 patch, then you could use attachment:run-115-flog.pickle.bz2 and attachment:run-116-flog.pickle.bz2 as evidence. Hm, hey waitasecond, in my earlier testing (recorded in this ticket), 1.8.0c2 was faster then 1.7.1 for small files (<= 10 MB). This was also the case for Nathan Eisenberg's benchmarks (posted to tahoe-dev). But currently it looks to me like the average download speed (as reported by curl during its operation) is the same at the beginning of the download as at the end, i.e. even during the first 10 MB or so 1.8.0c2 is only getting about 150 KBps where 1.7.1 is getting more than 200 KBps. Did something change?
I guess I (or someone) should run 1.7.1 vs. 1.8.0c2+comment:54 on 10 MB files. But I'm way too tired to start that again right now.
Man, I'm really worn out from staying up night after night poking at this and then having to get up early the next morning to help my children get ready for school and myself ready for work. I could use more help!
comment:62 follow-up: ↓ 63 Changed at 2010-08-19T06:44:51Z by davidsarah
Perhaps the remaining issue is server selection. Let's try Brian's comment:27 diversity-seeking algorithm, combined with the comment:54 fix.
comment:63 in reply to: ↑ 62 Changed at 2010-08-19T14:07:15Z by zooko
Replying to davidsarah:
Perhaps the remaining issue is server selection. Let's try Brian's comment:27 diversity-seeking algorithm, combined with the comment:54 fix.
I'm willing to try the comment:27 diversity-seeking algorithm, but I also would like to verify whether or not server-selection is one of the factors by inspecting the flogs...
comment:64 Changed at 2010-08-19T14:10:46Z by zooko
Yes, the overnight run yielded no smoking gun (smoking CPU?) that I can see. I'll attach the full profiling results as an attachment.
MUSI Zooko-Ofsimplegeos-MacBook-Pro:~/Desktop$ echo this is run 118 tahoe-lafs v1.8.0c2+comment:54 cProfile ; time curl http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg > bbb.ogg this is run 118 tahoe-lafs v1.8.0c2+comment:54 cProfile % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1421M 100 1421M 0 0 158k 0 2:32:39 2:32:39 --:--:-- 168k real 152m39.145s user 0m1.024s sys 0m7.661s
275630510 function calls (274835891 primitive calls) in 28165.965 CPU seconds ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 28165.965 28165.965 base.py:1156(run) 1 8.156 8.156 28165.964 28165.964 base.py:1161(mainLoop) 1226198 24.746 0.000 27834.397 0.023 selectreactor.py:93(doSelect) 1226198 27531.442 0.022 27547.532 0.022 {select.select} 1226198 8.846 0.000 313.488 0.000 base.py:751(runUntilCurrent) 148051 1.222 0.000 285.941 0.002 eventual.py:18(_turn) 1107126 12.628 0.000 257.892 0.000 log.py:71(callWithLogger) 1107126 10.433 0.000 243.790 0.000 log.py:66(callWithContext) 136521 3.020 0.000 229.335 0.002 share.py:187(loop) 1107126 5.102 0.000 218.355 0.000 context.py:58(callWithContext) 1107126 8.483 0.000 212.571 0.000 context.py:32(callWithContext) 136521 1.831 0.000 211.328 0.002 share.py:238(_do_loop) 1107126 9.274 0.000 200.718 0.000 selectreactor.py:144(_doReadOrWrite) 1064200 13.055 0.000 178.141 0.000 tcp.py:114(doRead) 1064262 15.145 0.000 165.679 0.000 tcp.py:443(doRead) 1100168/689294 10.315 0.000 135.808 0.000 defer.py:453(_runCallbacks) 536780/433736 1.023 0.000 110.261 0.000 defer.py:338(callback) 540512/437468 1.259 0.000 109.033 0.000 defer.py:433(_startRunCallbacks) 136521 3.623 0.000 107.058 0.001 share.py:701(_send_requests) 118185/118168 0.513 0.000 94.733 0.001 defer.py:108(maybeDeferred) 102399 0.197 0.000 81.560 0.001 share.py:733(_send_request) 102514 0.290 0.000 81.467 0.001 referenceable.py:406(callRemote) 102514 1.867 0.000 80.688 0.001 referenceable.py:418(_callRemote) 104846 1.446 0.000 79.737 0.001 banana.py:633(dataReceived) 104846 14.309 0.000 78.171 0.001 banana.py:701(handleData) 102515 0.320 0.000 72.324 0.001 banana.py:183(send) 102515 0.785 0.000 72.004 0.001 root.py:92(send) 1064200 70.123 0.000 70.123 0.000 {built-in method recv} 102562 7.417 0.000 68.190 0.001 banana.py:191(produce) 170643 1.742 0.000 60.005 0.000 share.py:267(_get_satisfaction) 1114150/1011232 2.898 0.000 50.939 0.000 defer.py:266(addCallbacks) 494839 5.175 0.000 42.363 0.000 hashtree.py:298(needed_hashes) 4010897 6.757 0.000 41.319 0.000 tcp.py:413(write) 494839 6.477 0.000 37.139 0.000 hashtree.py:128(needed_for) 4010897 18.303 0.000 34.562 0.000 abstract.py:177(write) 102809 1.151 0.000 33.768 0.000 banana.py:1049(handleClose) 136521 2.571 0.000 33.606 0.000 share.py:556(_desire) 45498 0.163 0.000 32.926 0.001 hashutil.py:51(tagged_hash) 45499 0.083 0.000 32.782 0.001 fetcher.py:83(loop) 1230099 3.737 0.000 32.762 0.000 banana.py:22(int2b128) 45499 0.451 0.000 32.699 0.001 fetcher.py:91(_do_loop) 227571 0.386 0.000 31.989 0.000 hashutil.py:31(update) 102514 1.152 0.000 31.951 0.000 call.py:652(receiveClose) 227571 31.557 0.000 31.557 0.000 {method 'update' of '_sha256.SHA256' objects}
Changed at 2010-08-19T14:11:27Z by zooko
Changed at 2010-08-19T17:27:57Z by warner
patch to prefer share diversity, forget leftover data after each segment, and fix handling of numsegs
comment:65 Changed at 2010-08-19T17:29:47Z by warner
the "1170-combo.diff" patch combines the approaches as suggested in comment:62 . Please give it a try and see if it helps. I'll try to look at the flogs to see what servers were used, to see if that run has a diversity issue or not.
comment:66 Changed at 2010-08-19T18:27:51Z by zooko
Okay, I investigated server selection on the bus to work this morning. attachment:run-115-flog.pickle.bz2 shows:
22:10:56.877 [649]: <CiphertextDownloader #1>(u6h6p4mlr3j7): starting download 22:10:56.881 [650]: sending DYHB to [tavrk54e] 22:10:56.882 [651]: sending DYHB to [xflgj7cg] 22:10:56.883 [652]: sending DYHB to [sp26qyqc] 22:10:56.884 [653]: sending DYHB to [zkj6swl2] 22:10:56.885 [654]: sending DYHB to [sroojqcx] 22:10:56.886 [655]: sending DYHB to [4rk5oqd7] 22:10:56.887 [656]: sending DYHB to [nszizgf5] 22:10:56.887 [657]: sending DYHB to [62nlabgf] 22:10:56.888 [658]: sending DYHB to [fp3xjndg] 22:10:56.888 [659]: sending DYHB to [rpiw4n3f] 22:10:57.006 [660]: got results from [xflgj7cg]: shnums [] 22:10:57.010 [661]: got results from [rpiw4n3f]: shnums [] 22:10:57.017 [662]: got results from [62nlabgf]: shnums [] 22:10:57.058 [663]: got results from [nszizgf5]: shnums [3, 7, 8] 22:10:57.066 [664]: got results from [4rk5oqd7]: shnums [2] 22:10:57.110 [665]: got results from [fp3xjndg]: shnums [4, 9] 22:10:57.126 [666]: got results from [tavrk54e]: shnums [] 22:10:57.181 [667]: got results from [zkj6swl2]: shnums [] 22:10:57.205 [668]: got results from [sroojqcx]: shnums [1, 6] 22:10:57.474 [669]: got results from [sp26qyqc]: shnums [0, 5]
The 1.7.1 flog doesn't which servers are actually being used for Request Blocks, but we know that 1.7.1 will always choose to get all three shares from nszizgf5 in a case like this.
Therefore I don't think that 1.8's share-selection can be part of the explanation for why 1.8 is slower than 1.7.
(This doesn't mean that improved share selection wouldn't make 1.9 faster than 1.8 is now.)
comment:67 in reply to: ↑ 61 Changed at 2010-08-19T19:13:37Z by warner
Replying to zooko:
Hm, hey waitasecond, in my earlier testing (recorded in this ticket), 1.8.0c2 was faster then 1.7.1 for small files (<= 10 MB). This was also the case for Nathan Eisenberg's benchmarks (posted to tahoe-dev). But currently it looks to me like the average download speed (as reported by curl during its operation) is the same at the beginning of the download as at the end, i.e. even during the first 10 MB or so 1.8.0c2 is only getting about 150 KBps where 1.7.1 is getting more than 200 KBps. Did something change?
There's a sizeable startup time in 1.7.1 (lots of roundtrips), which went away in 1.8.0c2 . I think we're all in agreement about the small-file speedups that provides (i.e. we've not seen any evidence to the contrary). The change is on the order of a few seconds, though, so I think a 10MB file (or portion of a file) that takes 10MB/150kBps= 60s to complete won't be affected very much. I don't think you'll be able to see its effects in the curl output.
Nathan's tests were on hundreds or thousands of small files.
From my tests, the new-downloader sees about 500ms more taken to complete the first segment than the second and following ones. I believe that's the time spend doing server selection, UEB fetches, and the large hash chain fetches.
Changed at 2010-08-20T00:07:51Z by zooko
I ran three more measurements today at the office -- runs 119, 120, and 121 . These are the curl stdout from those. I will update a table with these results and put it into the original opening comment of this ticket.
comment:68 Changed at 2010-08-20T00:13:38Z by zooko
- Description modified (diff)
Changed at 2010-08-20T16:31:25Z by zooko
I ran several more measurements from home, intended to test whether the logging in new-downloader is partially responsible for new-downloader's slowness. These are the curl stdout from those runs. I will update the table in the opening comment of this ticket to include these runs.
Changed at 2010-08-20T16:35:42Z by zooko
comment:69 Changed at 2010-08-20T16:43:19Z by zooko
- Description modified (diff)
Oh, all those 32-byte reads must have been all the hashes in the Merkle Trees. I assume that those are indeed coalesced using the clever spans structure src/allmydata/util/spans.py@4666. Nevertheless we should investigate the very poor performance shown in this download status file.