[tahoe-dev] Fwd: incident report
Zooko Wilcox-O'Hearn
zooko at zooko.com
Thu Nov 15 11:57:16 UTC 2012
On Thu, Nov 15, 2012 at 12:11 AM, Iantcho Vassilev <ianchov at gmail.com> wrote:
> Here is the incident archive as from the server..
Thanks, Iantcho!
This is interesting. Let's see... I see the hostnames of all the
storage servers from your grid. Oh, I see that Peter Secor is running
one of them. Cool.
It says that your tahoe-lafs node was overloaded with work so that it
took 1, 7, or even 10 seconds to do a simple "how long did it take me
to do nothing?" test which should have taken < 1 second. Was this
because the server machine was overloaded, or more likely was the
overload specific to the tahoe-lafs process?
It looks like you're using the same node for storage server and as a
gateway to upload and download files. Nothing wrong with that, just
talking to myself out loud about I see in here.
Okay, I don't see any unusual things in this log. Below are the
statistics from the log about the function of your storage server.
My guess is that the memory leak has something to do with this node
acting as a gateway (uploading and downloading files to remote
servers) rather than as a server. (Just because the gateway does a lot
more complicated work than the server does.) That doesn't mean the
memory leak is okay — I still want to fix it — but maybe you could
help track it down by using a different process for gateway and see
whether the gateway or the server is the one that starts using too
much memory next time around.
Regards,
Zooko
{'counters': {'downloader.bytes_downloaded': 323424955,
'downloader.files_downloaded': 54,
'mutable.bytes_published': 16519313,
'mutable.bytes_retrieved': 53957072,
'mutable.files_published': 205,
'mutable.files_retrieved': 744,
'storage_server.abort': 3,
'storage_server.add-lease': 1375,
'storage_server.allocate': 1610,
'storage_server.bytes_added': 4641488250,
'storage_server.close': 1248,
'storage_server.get': 40867,
'storage_server.read': 7625,
'storage_server.readv': 12895,
'storage_server.write': 236804,
'storage_server.writev': 745,
'uploader.bytes_uploaded': 4701532493,
'uploader.files_uploaded': 205},
'stats': {'chk_upload_helper.active_uploads': 0,
'chk_upload_helper.encoded_bytes': 0,
'chk_upload_helper.encoding_count': 0,
'chk_upload_helper.encoding_size': 0,
'chk_upload_helper.encoding_size_old': 0,
'chk_upload_helper.fetched_bytes': 0,
'chk_upload_helper.incoming_count': 0,
'chk_upload_helper.incoming_size': 0,
'chk_upload_helper.incoming_size_old': 0,
'chk_upload_helper.resumes': 0,
'chk_upload_helper.upload_already_present': 0,
'chk_upload_helper.upload_need_upload': 0,
'chk_upload_helper.upload_requests': 0,
'cpu_monitor.15min_avg': 0.011144453469539533,
'cpu_monitor.1min_avg': 0.018499974711772733,
'cpu_monitor.5min_avg': 0.01586668887236763,
'cpu_monitor.total': 2622.33,
'load_monitor.avg_load': 0.0435554305712382,
'load_monitor.max_load': 1.8420119285583496,
'node.uptime': 312675.6783568859,
'storage_server.accepting_immutable_shares': 1,
'storage_server.allocated': 0,
'storage_server.disk_avail': 400010970112,
'storage_server.disk_free_for_nonroot': 950010970112,
'storage_server.disk_free_for_root': 1004986490880,
'storage_server.disk_total': 1090848337920,
'storage_server.disk_used': 85861847040,
'storage_server.latencies.add-lease.01_0_percentile':
0.0001361370086669922,
'storage_server.latencies.add-lease.10_0_percentile':
0.00014400482177734375,
'storage_server.latencies.add-lease.50_0_percentile':
0.0005128383636474609,
'storage_server.latencies.add-lease.90_0_percentile':
0.01019287109375,
'storage_server.latencies.add-lease.95_0_percentile':
0.019355058670043945,
'storage_server.latencies.add-lease.99_0_percentile':
0.03629708290100098,
'storage_server.latencies.add-lease.99_9_percentile':
0.49553394317626953,
'storage_server.latencies.add-lease.mean': 0.003135397434234619,
'storage_server.latencies.add-lease.samplesize': 1000,
'storage_server.latencies.allocate.01_0_percentile':
0.00038504600524902344,
'storage_server.latencies.allocate.10_0_percentile':
0.0006988048553466797,
'storage_server.latencies.allocate.50_0_percentile':
0.0010409355163574219,
'storage_server.latencies.allocate.90_0_percentile':
0.015402078628540039,
'storage_server.latencies.allocate.95_0_percentile':
0.020718097686767578,
'storage_server.latencies.allocate.99_0_percentile':
0.040006160736083984,
'storage_server.latencies.allocate.99_9_percentile':
1.5722789764404297,
'storage_server.latencies.allocate.mean': 0.005430383205413818,
'storage_server.latencies.allocate.samplesize': 1000,
'storage_server.latencies.close.01_0_percentile':
9.608268737792969e-05,
'storage_server.latencies.close.10_0_percentile':
0.00010800361633300781,
'storage_server.latencies.close.50_0_percentile':
0.0002491474151611328,
'storage_server.latencies.close.90_0_percentile':
0.00026607513427734375,
'storage_server.latencies.close.95_0_percentile':
0.0002830028533935547,
'storage_server.latencies.close.99_0_percentile':
0.02466297149658203,
'storage_server.latencies.close.99_9_percentile':
0.44386720657348633,
'storage_server.latencies.close.mean': 0.00146563982963562,
'storage_server.latencies.close.samplesize': 1000,
'storage_server.latencies.get.01_0_percentile':
0.00011420249938964844,
'storage_server.latencies.get.10_0_percentile':
0.00019693374633789062,
'storage_server.latencies.get.50_0_percentile':
0.0003380775451660156,
'storage_server.latencies.get.90_0_percentile': 0.0416719913482666,
'storage_server.latencies.get.95_0_percentile': 0.06571078300476074,
'storage_server.latencies.get.99_0_percentile': 0.24586009979248047,
'storage_server.latencies.get.99_9_percentile': 2.745851993560791,
'storage_server.latencies.get.mean': 0.021070765972137452,
'storage_server.latencies.get.samplesize': 1000,
'storage_server.latencies.read.01_0_percentile': 1.9073486328125e-05,
'storage_server.latencies.read.10_0_percentile':
2.002716064453125e-05,
'storage_server.latencies.read.50_0_percentile':
2.2172927856445312e-05,
'storage_server.latencies.read.90_0_percentile':
4.601478576660156e-05,
'storage_server.latencies.read.95_0_percentile':
6.29425048828125e-05,
'storage_server.latencies.read.99_0_percentile':
0.00017309188842773438,
'storage_server.latencies.read.99_9_percentile': 0.01799607276916504,
'storage_server.latencies.read.mean': 0.0001399552822113037,
'storage_server.latencies.read.samplesize': 1000,
'storage_server.latencies.readv.01_0_percentile':
0.00012803077697753906,
'storage_server.latencies.readv.10_0_percentile':
0.00014901161193847656,
'storage_server.latencies.readv.50_0_percentile':
0.00024318695068359375,
'storage_server.latencies.readv.90_0_percentile':
0.00039505958557128906,
'storage_server.latencies.readv.95_0_percentile':
0.00046706199645996094,
'storage_server.latencies.readv.99_0_percentile':
0.03636002540588379,
'storage_server.latencies.readv.99_9_percentile': 1.7137258052825928,
'storage_server.latencies.readv.mean': 0.0028906521797180174,
'storage_server.latencies.readv.samplesize': 1000,
'storage_server.latencies.write.01_0_percentile':
5.984306335449219e-05,
'storage_server.latencies.write.10_0_percentile':
8.296966552734375e-05,
'storage_server.latencies.write.50_0_percentile':
9.393692016601562e-05,
'storage_server.latencies.write.90_0_percentile':
0.00013208389282226562,
'storage_server.latencies.write.95_0_percentile':
0.00014901161193847656,
'storage_server.latencies.write.99_0_percentile':
0.0001728534698486328,
'storage_server.latencies.write.99_9_percentile': 1.1543080806732178,
'storage_server.latencies.write.mean': 0.0019269251823425292,
'storage_server.latencies.write.samplesize': 1000,
'storage_server.latencies.writev.01_0_percentile':
0.00039196014404296875,
'storage_server.latencies.writev.10_0_percentile':
0.0004100799560546875,
'storage_server.latencies.writev.50_0_percentile':
0.0005409717559814453,
'storage_server.latencies.writev.90_0_percentile':
0.0009508132934570312,
'storage_server.latencies.writev.95_0_percentile':
0.0010209083557128906,
'storage_server.latencies.writev.99_0_percentile':
0.016919851303100586,
'storage_server.latencies.writev.99_9_percentile': None,
'storage_server.latencies.writev.mean': 0.0010515379425663277,
'storage_server.latencies.writev.samplesize': 745,
'storage_server.reserved_space': 550000000000,
'storage_server.total_bucket_count': 67873}}
More information about the tahoe-dev
mailing list