Changes between Version 32 and Version 33 of Performance


Ignore:
Timestamp:
2011-04-11T16:22:21Z (13 years ago)
Author:
zooko
Comment:

link to Performance/Old

Legend:

Unmodified
Added
Removed
Modified
  • Performance

    v32 v33  
    1 Some basic notes on performance:
    2 
    3 DISCLAIMER: the memory footprint measurements documented on this page and graphed (see the hyperlinks below) are based on !VmSize in linux. !VmSize almost certainly doesn't correlate with what you care about. For example, it doesn't correlate very well at all with whether your server will go into swap thrash, or how much RAM you need to provision for your server, or, well, anything that you care about. Yes, in case it isn't clear, I (Zooko) consider this measurement to be useless. Please see ticket #227 in which I go into more detail about this.
    4  
    5 == Memory Footprint ==
    6 
    7 We try to keep the Tahoe memory footprint low by continuously monitoring the
    8 memory consumed by common operations like upload and download.
    9 
    10 For each currently active upload or download, we never handle more than a
    11 single segment of data at a time. This serves to keep the data-driven
    12 footprint down to something like 4MB or 5MB per active upload/download.
    13 
    14 Some other notes on memory footprint:
    15 
    16  * importing sqlite (for the share-lease database) raised the static
    17    footprint by 6MB, going from 24.3MB to 31.5MB (as evidenced by the munin
    18    graph from 2007-08-29 to 2007-09-02).
    19 
    20  * importing nevow and twisted.web (for the web interface) raises the static
    21    footprint by about 3MB (from 12.8MB to 15.7MB).
    22 
    23  * importing pycryptopp (which began on 2007-11-09) raises the static footprint
    24    (on a 32-bit machine) by about 6MB (from 19MB to 25MB). The 64-bit machine
    25    footprint was raised by 17MB (from 122MB to 139MB).
    26 
    27 The
    28 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_memstats.html 32-bit memory usage graph]
    29 shows our static memory footprint on a 32bit machine (starting a node but not doing
    30 anything with it) to be about 24MB. Uploading one file at a time gets the
    31 node to about 29MB. (we only process one segment at a time, so peak memory
    32 consumption occurs when the file is a few MB in size and does not grow beyond
    33 that). Uploading multiple files at once would increase this.
    34 
    35 We also have a
    36 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_memstats_64.html 64-bit memory usage graph], which currently shows a disturbingly large static footprint.
    37 We've determined that simply importing a few of our support libraries (such
    38 as Twisted) results in most of this expansion, before the node is ever even
    39 started. The cause for this is still being investigated: we can think of plenty
    40 of reasons for it to be 2x, but the results show something closer to 6x.
    41 
    42 == Network Speed ==
    43 
    44 === Test Results ===
    45 
    46 Using a 3-server testnet in colo and an uploading node at home (on a DSL line
    47 that gets about 78kBps upstream and has a 14ms ping time to colo) using
    48 0.5.1-34 takes 820ms-900ms per 1kB file uploaded (80-90s for 100 files, 819s
    49 for 1000 files). The DSL speed results are occasionally worse than usual,
    50 when the owner of the DSL line is using it for other purposes while a test is
    51 taking place.
    52 
    53 'scp' of 3.3kB files (simulating expansion) takes 8.3s for 100 files and 79s
    54 for 1000 files, 80ms each.
    55 
    56 Doing the same uploads locally on my laptop (both the uploading node and the
    57 storage nodes are local) takes 46s for 100 1kB files and 369s for 1000 files.
    58 
    59 Small files seem to be limited by a per-file overhead. Large files are limited
    60 by the link speed.
    61 
    62 The munin
    63 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay.html delay graph] and
    64 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_rate.html rate graph] show these Ax+B numbers for a node in colo and a node behind a DSL line.
    65 
    66 The
    67 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay_rtt.html delay*RTT graph] shows this per-file delay as a multiple of the average round-trip
    68 time between the client node and the testnet. Much of the work done to upload
    69 a file involves waiting for message to make a round-trip, so expressing the
    70 per-file delay in units of RTT helps to compare the observed performance
    71 against the predicted value.
    72 
    73 === Mutable Files ===
    74 
    75 Tahoe's mutable files (sometimes known as "SSK" files) are encoded
    76 differently than the immutable ones (aka "CHK" files). Creating these mutable
    77 file slots currently (in release 0.7.0) requires an RSA keypair generation.
    78 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_SSK_creation.html This graph]
    79 tracks the amount of time it takes to perform
    80 this step.
    81 
    82 There is also per-file overhead for upload and download, just like with CHK
    83 files, mostly involving the queries to find out which servers are holding
    84 which versions of the file. The
    85 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay_SSK.html mutable-file delay graph]
    86 shows this "B" per-file latency value.
    87 
    88 The "A" transfer rate for SSK files is also tracked in this
    89 [http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_rate_SSK.html SSK rate graph].
    90 
    91 === Roundtrips ===
    92 
    93 The 0.5.1 release requires about 9 roundtrips for each share it uploads. The
    94 upload algorithm sends data to all shareholders in parallel, but these 9
    95 phases are done sequentially. The phases are:
    96 
    97  1. allocate_buckets
    98  2. send_subshare (once per segment)
    99  3. send_plaintext_hash_tree
    100  4. send_crypttext_hash_tree
    101  5. send_subshare_hash_trees
    102  6. send_share_hash_trees
    103  7. send_UEB
    104  8. close
    105  9. dirnode update
    106 
    107 We need to keep the send_subshare calls sequential (to keep our memory
    108 footprint down), and we need a barrier between the close and the dirnode
    109 update (for robustness and clarity), but the others could be pipelined.
    110 9*14ms=126ms, which accounts for about 15% of the measured upload time.
    111 
    112 Doing steps 2-8 in parallel (using the attached pipeline-sends.diff patch)
    113 does indeed seem to bring the time-per-file down from 900ms to about 800ms,
    114 although the results aren't conclusive.
    115 
    116 With the pipeline-sends patch, my uploads take A+B*size time, where A is 790ms
    117 and B is 1/23.4kBps . 3.3/B gives the same speed that basic 'scp' gets, which
    118 ought to be my upstream bandwidth. This suggests that the main limitation to
    119 upload speed is the constant per-file overhead, and the FEC expansion factor.
    120 
    121 == Storage Servers ==
    122 
    123 == System Load ==
    124 
    125 The source:src/allmydata/test/check_load.py tool can be used to generate
    126 random upload/download traffic, to see how much load a Tahoe grid imposes on
    127 its hosts.
    128 
    129 === test one: 10kB mean file size ===
    130 
    131 Preliminary results on the Allmydata test grid (14 storage servers spread
    132 across four machines (each a 3ishGHz P4), two web servers): we used three
    133 check_load.py clients running with 100ms delay between requests, an
    134 80%-download/20%-upload traffic mix, and file sizes distributed exponentially
    135 with a mean of 10kB. These three clients get about 8-15kBps downloaded,
    136 2.5kBps uploaded, doing about one download per second and 0.25 uploads per
    137 second. These traffic rates were higher at the beginning of the process (when
    138 the directories were smaller and thus faster to traverse).
    139 
    140 The storage servers were minimally loaded. Each storage node was consuming
    141 about 9% of its CPU at the start of the test, 5% at the end. These nodes were
    142 receiving about 50kbps throughout, and sending 50kbps initially (increasing
    143 to 150kbps as the dirnodes got larger). Memory usage was trivial, about 35MB
    144 !VmSize per node, 25MB RSS. The load average on a 4-node box was about 0.3 .
    145 
    146 The two machines serving as web servers (performing all encryption, hashing,
    147 and erasure-coding) were the most heavily loaded. The clients distribute
    148 their requests randomly between the two web servers. Each server was
    149 averaging 60%-80% CPU usage. Memory consumption is minor, 37MB !VmSize and
    150 29MB RSS on one server, 45MB/33MB on the other. Load average grew from about
    151 0.6 at the start of the test to about 0.8 at the end. Network traffic
    152 (including both client-side plaintext and server-side shares) outbound was
    153 about 600Kbps for the whole test, while the inbound traffic started at
    154 200Kbps and rose to about 1Mbps at the end.
    155 
    156 === test two: 1MB mean file size ===
    157 
    158 Same environment as before, but the mean file size was set to 1MB instead of
    159 10kB.
    160 
    161 {{{
    162 clients: 2MBps down, 340kBps up, 1.37 fps down, .36 fps up
    163 tahoecs2: 60% CPU, 14Mbps out, 11Mbps in, load avg .74  (web server)
    164 tahoecs1: 78% CPU, 7Mbps out, 17Mbps in, load avg .91  (web server)
    165 tahoebs4: 26% CPU, 4.7Mbps out, 3Mbps in, load avg .50  (storage server)
    166 tahoebs5: 34% CPU, 4.5Mbps out, 3Mbps in  (storage server)
    167 }}}
    168 
    169 Load is about the same as before, but of course the bandwidths are larger.
    170 For this file size, the per-file overhead seems to be more of a limiting
    171 factor than per-byte overhead.
    172 
    173 === test three: 80% upload, 20% download, 1MB mean file size ===
    174 
    175 Same environment as test 2, but 80% of the operations are uploads.
    176 
    177 {{{
    178 clients: 150kBps down, 680kBps up, .14 fps down, .67 fps up
    179 tahoecs1: 62% CPU, 11Mbps out, 2.9Mbps in, load avg .85
    180 tahoecs2: 57% CPU, 10Mbps out, 4Mbps in, load avg .76
    181 tahoebs4: 16% CPU, 700kBps out, 5.4Mbps in, load avg 0.4ish
    182 tahoebs5: 21%, 870kBps out, 5.1Mbps in, load avg about 0.35
    183 }}}
    184 
    185 Overall throughput is about half of the download case. Either uploading files
    186 or modifying the dirnodes looks to be more expensive than downloading. The
    187 CPU usage on the web servers was lower, suggesting that the expense might be
    188 in round trips rather than actual computation.
    189 
    190 === initial conclusions ===
    191 
    192 So far, Tahoe is scaling as designed: the client nodes are the ones doing
    193 most of the work, since these are the easiest to scale. In a deployment where
    194 central machines are doing encoding work, CPU on these machines will be the
    195 first bottleneck. Profiling can be used to determine how the upload process
    196 might be optimized: we don't yet know if encryption, hashing, or encoding is
    197 a primary CPU consumer. We can change the upload/download ratio to examine
    198 upload and download separately.
    199 
    200 Deploying large networks in which clients are not doing their own encoding
    201 will require sufficient CPU resources. Storage servers use minimal CPU, so
    202 having all storage servers also be web/encoding servers is a natural
    203 approach.
     1(See also copious notes and data about performance of older versions of Tahoe-LAFS, archived at Performance/Old.)