Context Navigation

source: trunk/docs/stats.rst

Visit:

Last change on this file was 6310774, checked in by Florian Sesser <florian@…>, at 2022-09-08T17:50:58Z

Add documentation on OpenMetrics? statistics endpoint.

references ticket:3786

Property mode set to 100644

File size: 11.5 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	================
4	Tahoe Statistics
5	================
6
7	1. `Overview`_
8	2. `Statistics Categories`_
9	3. `Using Munin To Graph Stats Values`_
10
11	Overview
12	========
13
14	Each Tahoe node collects and publishes statistics about its operations as it
15	runs. These include counters of how many files have been uploaded and
16	downloaded, CPU usage information, performance numbers like latency of
17	storage server operations, and available disk space.
18
19	The easiest way to see the stats for any given node is use the web interface.
20	From the main "Welcome Page", follow the "Operational Statistics" link inside
21	the small "This Client" box. If the welcome page lives at
22	http://localhost:3456/, then the statistics page will live at
23	http://localhost:3456/statistics . This presents a summary of the stats
24	block, along with a copy of the raw counters. To obtain just the raw counters
25	(in JSON format), use /statistics?t=json instead.
26
27	Statistics Categories
28	=====================
29
30	The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
31	are strictly counters: they are reset to zero when the node is started, and
32	grow upwards. 'stats' are non-incrementing values, used to measure the
33	current state of various systems. Some stats are actually booleans, expressed
34	as '1' for true and '0' for false (internal restrictions require all stats
35	values to be numbers).
36
37	Under both the 'counters' and 'stats' dictionaries, each individual stat has
38	a key with a dot-separated name, breaking them up into groups like
39	'cpu_monitor' and 'storage_server'.
40
41	The currently available stats (as of release 1.6.0 or so) are described here:
42
43	counters.storage_server.\*
44
45	this group counts inbound storage-server operations. They are not provided
46	by client-only nodes which have been configured to not run a storage server
47	(with [storage]enabled=false in tahoe.cfg)
48
49	allocate, write, close, abort
50	these are for immutable file uploads. 'allocate' is incremented when a
51	client asks if it can upload a share to the server. 'write' is
52	incremented for each chunk of data written. 'close' is incremented when
53	the share is finished. 'abort' is incremented if the client abandons
54	the upload.
55
56	get, read
57	these are for immutable file downloads. 'get' is incremented
58	when a client asks if the server has a specific share. 'read' is
59	incremented for each chunk of data read.
60
61	readv, writev
62	these are for immutable file creation, publish, and retrieve. 'readv'
63	is incremented each time a client reads part of a mutable share.
64	'writev' is incremented each time a client sends a modification
65	request.
66
67	add-lease, renew, cancel
68	these are for share lease modifications. 'add-lease' is incremented
69	when an 'add-lease' operation is performed (which either adds a new
70	lease or renews an existing lease). 'renew' is for the 'renew-lease'
71	operation (which can only be used to renew an existing one). 'cancel'
72	is used for the 'cancel-lease' operation.
73
74	bytes_freed
75	this counts how many bytes were freed when a 'cancel-lease'
76	operation removed the last lease from a share and the share
77	was thus deleted.
78
79	bytes_added
80	this counts how many bytes were consumed by immutable share
81	uploads. It is incremented at the same time as the 'close'
82	counter.
83
84	stats.storage_server.\*
85
86	allocated
87	this counts how many bytes are currently 'allocated', which
88	tracks the space that will eventually be consumed by immutable
89	share upload operations. The stat is increased as soon as the
90	upload begins (at the same time the 'allocated' counter is
91	incremented), and goes back to zero when the 'close' or 'abort'
92	message is received (at which point the 'disk_used' stat should
93	incremented by the same amount).
94
95	disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
96	these all reflect disk-space usage policies and status.
97	'disk_total' is the total size of disk where the storage
98	server's BASEDIR/storage/shares directory lives, as reported
99	by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
100	and 'disk_free_for_nonroot' show related information.
101	'reserved_space' reports the reservation configured by the
102	tahoe.cfg [storage]reserved_space value. 'disk_avail'
103	reports the remaining disk space available for the Tahoe
104	server after subtracting reserved_space from disk_avail. All
105	values are in bytes.
106
107	accepting_immutable_shares
108	this is '1' if the storage server is currently accepting uploads of
109	immutable shares. It may be '0' if a server is disabled by
110	configuration, or if the disk is full (i.e. disk_avail is less than
111	reserved_space).
112
113	total_bucket_count
114	this counts the number of 'buckets' (i.e. unique
115	storage-index values) currently managed by the storage
116	server. It indicates roughly how many files are managed
117	by the server.
118
119	latencies..
120	these stats keep track of local disk latencies for
121	storage-server operations. A number of percentile values are
122	tracked for many operations. For example,
123	'storage_server.latencies.readv.50_0_percentile' records the
124	median response time for a 'readv' request. All values are in
125	seconds. These are recorded by the storage server, starting
126	from the time the request arrives (post-deserialization) and
127	ending when the response begins serialization. As such, they
128	are mostly useful for measuring disk speeds. The operations
129	tracked are the same as the counters.storage_server.* counter
130	values (allocate, write, close, get, read, add-lease, renew,
131	cancel, readv, writev). The percentile values tracked are:
132	mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
133	90_0_percentile, 95_0_percentile, 99_0_percentile,
134	99_9_percentile. (the last value, 99.9 percentile, means that
135	999 out of the last 1000 operations were faster than the
136	given number, and is the same threshold used by Amazon's
137	internal SLA, according to the Dynamo paper).
138	Percentiles are only reported in the case of a sufficient
139	number of observations for unambiguous interpretation. For
140	example, the 99.9th percentile is (at the level of thousandths
141	precision) 9 thousandths greater than the 99th
142	percentile for sample sizes greater than or equal to 1000,
143	thus the 99.9th percentile is only reported for samples of 1000
144	or more observations.
145
146
147	counters.uploader.files_uploaded
148
149	counters.uploader.bytes_uploaded
150
151	counters.downloader.files_downloaded
152
153	counters.downloader.bytes_downloaded
154
155	These count client activity: a Tahoe client will increment these when it
156	uploads or downloads an immutable file. 'files_uploaded' is incremented by
157	one for each operation, while 'bytes_uploaded' is incremented by the size of
158	the file.
159
160	counters.mutable.files_published
161
162	counters.mutable.bytes_published
163
164	counters.mutable.files_retrieved
165
166	counters.mutable.bytes_retrieved
167
168	These count client activity for mutable files. 'published' is the act of
169	changing an existing mutable file (or creating a brand-new mutable file).
170	'retrieved' is the act of reading its current contents.
171
172	counters.chk_upload_helper.\*
173
174	These count activity of the "Helper", which receives ciphertext from clients
175	and performs erasure-coding and share upload for files that are not already
176	in the grid. The code which implements these counters is in
177	src/allmydata/immutable/offloaded.py .
178
179	upload_requests
180	incremented each time a client asks to upload a file
181	upload_already_present: incremented when the file is already in the grid
182
183	upload_need_upload
184	incremented when the file is not already in the grid
185
186	resumes
187	incremented when the helper already has partial ciphertext for
188	the requested upload, indicating that the client is resuming an
189	earlier upload
190
191	fetched_bytes
192	this counts how many bytes of ciphertext have been fetched
193	from uploading clients
194
195	encoded_bytes
196	this counts how many bytes of ciphertext have been
197	encoded and turned into successfully-uploaded shares. If no
198	uploads have failed or been abandoned, encoded_bytes should
199	eventually equal fetched_bytes.
200
201	stats.chk_upload_helper.\*
202
203	These also track Helper activity:
204
205	active_uploads
206	how many files are currently being uploaded. 0 when idle.
207
208	incoming_count
209	how many cache files are present in the incoming/ directory,
210	which holds ciphertext files that are still being fetched
211	from the client
212
213	incoming_size
214	total size of cache files in the incoming/ directory
215
216	incoming_size_old
217	total size of 'old' cache files (more than 48 hours)
218
219	encoding_count
220	how many cache files are present in the encoding/ directory,
221	which holds ciphertext files that are being encoded and
222	uploaded
223
224	encoding_size
225	total size of cache files in the encoding/ directory
226
227	encoding_size_old
228	total size of 'old' cache files (more than 48 hours)
229
230	stats.node.uptime
231	how many seconds since the node process was started
232
233	stats.cpu_monitor.\*
234
235	1min_avg, 5min_avg, 15min_avg
236	estimate of what percentage of system CPU time was consumed by the
237	node process, over the given time interval. Expressed as a float, 0.0
238	for 0%, 1.0 for 100%
239
240	total
241	estimate of total number of CPU seconds consumed by node since
242	the process was started. Ticket #472 indicates that .total may
243	sometimes be negative due to wraparound of the kernel's counter.
244
245
246	Using Munin To Graph Stats Values
247	=================================
248
249	The misc/operations_helpers/munin/ directory contains various plugins to
250	graph stats for Tahoe nodes. They are intended for use with the Munin_
251	system-management tool, which typically polls target systems every 5 minutes
252	and produces a web page with graphs of various things over multiple time
253	scales (last hour, last month, last year).
254
255	Most of the plugins are designed to pull stats from a single Tahoe node, and
256	are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
257	"tahoe_stats" plugin is designed to read from the JSON file created by the
258	stats-gatherer. Some plugins are to be used with the disk watcher, and a few
259	(like tahoe_nodememory) are designed to watch the node processes directly
260	(and must therefore run on the same host as the target node).
261
262	Please see the docstrings at the beginning of each plugin for details, and
263	the "tahoe-conf" file for notes about configuration and installing these
264	plugins into a Munin environment.
265
266	.. _Munin: http://munin-monitoring.org/
267
268
269	Scraping Stats Values in OpenMetrics Format
270	===========================================
271
272	Time Series DataBase (TSDB) software like Prometheus_ and VictoriaMetrics_ can
273	parse statistics from the e.g. http://localhost:3456/statistics?t=openmetrics
274	URL in OpenMetrics_ format. Software like Grafana_ can then be used to graph
275	and alert on these numbers. You can find a pre-configured dashboard for
276	Grafana at https://grafana.com/grafana/dashboards/16894-tahoe-lafs/.
277
278	.. _OpenMetrics: https://openmetrics.io/
279	.. _Prometheus: https://prometheus.io/
280	.. _VictoriaMetrics: https://victoriametrics.com/
281	.. _Grafana: https://grafana.com/

Note: See TracBrowser for help on using the repository browser.

Download in other formats: