Context Navigation

Changes between Version 20 and Version 21 of Performance

Timestamp:: 2007-12-19T00:46:10Z (18 years ago)
Author:: warner
Comment:: add notes on load testing

Legend:

: Unmodified
: Added
: Removed
: Modified

Performance

-                      v20
+                      v21
 == Storage Servers ==
+=== storage index count ===
 ext3 (on tahoebs1) refuses to create more than 32000 subdirectories in a
 single parent directory. In 0.5.1, this appears as a limit on the number of
 …
 I was unable to measure a consistent slowdown resulting from having 30000
 buckets in a single storage server.
+== System Load ==
+The source:src/allmydata/test/check_load.py tool can be used to generate
+random upload/download traffic, to see how much load a Tahoe grid imposes on
+its hosts.
+Preliminary results on the Allmydata test grid (14 storage servers spread
+across four machines (each a 3ishGHz P4), two web servers): we used three
+check_load.py clients running with 100ms delay between requests, an
+%-download/20%-upload traffic mix, and file sizes distributed exponentially
+with a mean of 10kB. These three clients get about 8-15kBps downloaded,
+.5kBps uploaded, doing about one download per second and 0.25 uploads per
+second. These traffic rates were higher at the beginning of the process (when
+the directories were smaller and thus faster to traverse).
+The storage servers were minimally loaded. Each storage node was consuming
+about 9% of its CPU at the start of the test, 5% at the end. These nodes were
+receiving about 50kbps throughout, and sending 50kbps initially (increasing
+to 150kbps as the dirnodes got larger). Memory usage was trivial, about 35MB
+VmSize per node, 25MB RSS. The load average on a 4-node box was about 0.3 .
+The two machines serving as web servers (performing all encryption, hashing,
+and erasure-coding) were the most heavily loaded. The clients distribute
+their requests randomly between the two web servers. Each server was
+averaging 60%-80% CPU usage. Memory consumption is minor, 37MB VmSize and
+MB RSS on one server, 45MB/33MB on the other. Load average grew from about
+.6 at the start of the test to about 0.8 at the end. Network traffic
+(including both client-side plaintext and server-side shares) outbound was
+about 600Kbps for the whole test, while the inbound traffic started at
+Kbps and rose to about 1Mbps at the end.
+=== initial conclusions ===
+So far, Tahoe is scaling as designed: the client nodes are the ones doing
+most of the work, since these are the easiest to scale. In a deployment where
+central machines are doing encoding work, CPU on these machines will be the
+first bottleneck. Profiling can be used to determine how the upload process
+might be optimized: we don't yet know if encryption, hashing, or encoding is
+a primary CPU consumer. We can change the upload/download ratio to examine
+upload and download separately.
+Deploying large networks in which clients are not doing their own encoding
+will require sufficient CPU resources. Storage servers use minimal CPU, so
+having all storage servers also be web/encoding servers is a natural
+approach.