#392 closed enhancement (fixed)
pipeline upload segments to make upload faster
Reported by: | warner | Owned by: | warner |
---|---|---|---|
Priority: | major | Milestone: | 1.5.0 |
Component: | code-performance | Version: | 1.0.0 |
Keywords: | speed | Cc: | |
Launchpad Bug: |
Description
In ticket #252 we decided to reduce the max segment size from 1MiB to 128KiB. But this caused in-colo upload speed to drop by at least 50%.
We should see if we can pipeline two segments for upload, to get back the extra round-trip times that we lost with having more segments.
It's also possible that some of the slowdown is just from the extra overhead of computing more hashes, but I suspect the turnaround time more than overhead.
We need to do something similar for download too, since the download speed was reduced drastically by the segsize change too.
Attachments (1)
Change History (6)
comment:1 Changed at 2008-05-14T18:09:52Z by warner
comment:2 Changed at 2008-06-01T22:09:39Z by warner
#320 is related, since the storage-server protocol changes we talked about would make it easier to implement the pipelining.
comment:3 Changed at 2009-04-15T20:10:20Z by warner
So, using the attached patch, I added pipelined writes to the immutable upload operation. The Pipeline class allows up to 50KB in the pipe before it starts blocking the sender (specifically, the calls to WriteBucketProxy._write return defer.succeed until there is more than 50KB of unacknowledged data in the pipe, after which it returns regular Deferreds until some of those writes get retired. A terminal flush() call causes the Upload to wait for the pipeline to drain before it is considered complete).
A quick performance test (in the same environments that we do the buildbot performance tests on: my home DSL line and tahoecs2 in colo) showed a significant improvement in the DSL per-file overhead, but only about a 10% improvement in the overall upload rate (for both DSL and colo).
Basically, the 7 writes used to write a small file (header, segment 0, crypttext_hashtree, block_hashtree, share_hashtree, uri_extension, close) are all put on the wire together, so they take bandwidth plus 1 RTT instead of bandwidth plus 7 RTT. The savings of 6 RTT appears to save us about 1.8 seconds over my DSL line. (my ping time to the servers is about 11ms, but then there's kernel/python/twisted/foolscap/tahoe overhead on top of that).
For a larger file, pipelining might increase the utilization of the wire, particularly if you have a "long fat" pipe (high bandwidth but high latency). However, with 10 shares going out at the same time, the wire is probably pretty full already: the ratio of interest is segsize*N/k/BW / RTT . You send N blocks for a single segment at once, then you wait for all the replies to come back, then generate the next blocks. If the time it takes to send a single block is greater than the server's turnaround time, then N-1 responses will be received before the last block is finished sending, so you've only got one RTT of idle time (while you wait for the last server to respond). Pipelining will fill this last RTT, but my guess is that isn't that much of a help, and that something else is needed to explain the performance hit we saw in colo when we moved to larger segments.
DSL no pipelining:
TIME (startup): 2.36461615562 up, 0.719145059586 down TIME (1x 200B): 2.38471603394 up, 0.734190940857 down TIME (10x 200B): 21.7909920216 up, 8.98366594315 down TIME (1MB): 45.8974239826 up, 5.21775698662 down TIME (10MB): 449.196600914 up, 34.1318571568 down upload per-file time: 2.179s upload speed (1MB): 22.87kBps upload speed (10MB): 22.37kBps
DSL with pipelining:
TIME (startup): 0.437352895737 up, 0.185742139816 down TIME (1x 200B): 0.493880987167 up, 0.202013969421 down TIME (10x 200B): 5.15211510658 up, 2.04516386986 down TIME (1MB): 43.141931057 up, 2.09753513336 down TIME (10MB): 416.777194977 up, 19.6058299541 down upload per-file time: 0.515s upload speed (1MB): 23.46kBps upload speed (10MB): 24.02kBps
The in-colo tests showed roughly the same improvement to upload speed, but very little change to the per-file time. The RTT time there is shorter (ping time is about 120us), which might explain the difference. But I think the slowdown lies elsewhere. Pipelining shaves about 30ms off each file, and increases the overall upload speed by about 10%.
colo no pipelining:
TIME (startup): 0.29696393013 up, 0.0784759521484 down TIME (1x 200B): 0.285771131516 up, 0.0790619850159 down TIME (10x 200B): 3.23165798187 up, 0.849181175232 down TIME (100x 200B): 31.7827451229 up, 8.95765590668 down TIME (1MB): 1.00738477707 up, 0.347244977951 down TIME (10MB): 7.12743496895 up, 2.9827849865 down TIME (100MB): 70.9683670998 up, 25.6454920769 down upload per-file time: 0.318s upload per-file times-avg-RTT: 83.833386 upload per-file times-total-RTT: 20.958347 upload speed (1MB): 1.45MBps upload speed (10MB): 1.47MBps upload speed (100MB): 1.42MBps
colo with pipelining:
TIME (startup): 0.262734889984 up, 0.0758249759674 down TIME (1x 200B): 0.271718025208 up, 0.0812950134277 down TIME (10x 200B): 2.80361104012 up, 0.838641881943 down TIME (100x 200B): 28.4790999889 up, 9.36092710495 down TIME (1MB): 0.853738069534 up, 0.337486028671 down TIME (10MB): 6.6658270359 up, 2.67381596565 down TIME (100MB): 64.6233050823 up, 26.5593090057 down upload per-file time: 0.285s upload per-file times-avg-RTT: 77.205647 upload per-file times-total-RTT: 19.301412 upload speed (1MB): 1.76MBps upload speed (10MB): 1.57MBps upload speed (100MB): 1.55MBps
I want to run some more tests before landing this patch, to make sure it's really doing what I though it should be doing. I'd also like to improve the automated speed-test to do a simple TCP transfer to measure the available upstream bandwidth, so we can compare tahoe's upload speed against the actual wire.
comment:4 Changed at 2009-05-18T23:46:26Z by warner
- Milestone changed from eventually to 1.5.0
- Resolution set to fixed
- Status changed from new to closed
I pushed this patch anyways.. I think it'll help, just not as much as I was hoping for.
comment:5 Changed at 2017-01-11T00:35:09Z by Brian Warner <warner@…>
In 5e1d464/trunk:
Oh, and I just thought of the right place to do this too: in the WriteBucketProxy. It should be allowed to keep a Nagle-like cache of write vectors, and send them out in a batch when the cache gets larger than some particular size (that will coalesce small writes into a single call, reducing the round-trip time). In addition, it should be allowed to have multiple calls outstanding if the total amount of data that it has sent (and therefore might be in the transport buffer) is below some amount, say 128KiB. If k=3, then that should allow three segments to be on the wire at once, mitigating the slowdown due to round trips. As long as the RTT time is less than the bandwidth*windowsize, this should keep the pipe full.