source: trunk/docs/stats.rst

Last change on this file was 6310774, checked in by Florian Sesser <florian@…>, at 2022-09-08T17:50:58Z

Add documentation on OpenMetrics? statistics endpoint.

references ticket:3786

  • Property mode set to 100644
File size: 11.5 KB
Line 
1.. -*- coding: utf-8-with-signature -*-
2
3================
4Tahoe Statistics
5================
6
71. `Overview`_
82. `Statistics Categories`_
93. `Using Munin To Graph Stats Values`_
10
11Overview
12========
13
14Each Tahoe node collects and publishes statistics about its operations as it
15runs. These include counters of how many files have been uploaded and
16downloaded, CPU usage information, performance numbers like latency of
17storage server operations, and available disk space.
18
19The easiest way to see the stats for any given node is use the web interface.
20From the main "Welcome Page", follow the "Operational Statistics" link inside
21the small "This Client" box. If the welcome page lives at
22http://localhost:3456/, then the statistics page will live at
23http://localhost:3456/statistics . This presents a summary of the stats
24block, along with a copy of the raw counters. To obtain just the raw counters
25(in JSON format), use /statistics?t=json instead.
26
27Statistics Categories
28=====================
29
30The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
31are strictly counters: they are reset to zero when the node is started, and
32grow upwards. 'stats' are non-incrementing values, used to measure the
33current state of various systems. Some stats are actually booleans, expressed
34as '1' for true and '0' for false (internal restrictions require all stats
35values to be numbers).
36
37Under both the 'counters' and 'stats' dictionaries, each individual stat has
38a key with a dot-separated name, breaking them up into groups like
39'cpu_monitor' and 'storage_server'.
40
41The currently available stats (as of release 1.6.0 or so) are described here:
42
43**counters.storage_server.\***
44
45    this group counts inbound storage-server operations. They are not provided
46    by client-only nodes which have been configured to not run a storage server
47    (with [storage]enabled=false in tahoe.cfg)
48
49    allocate, write, close, abort
50        these are for immutable file uploads. 'allocate' is incremented when a
51        client asks if it can upload a share to the server. 'write' is
52        incremented for each chunk of data written. 'close' is incremented when
53        the share is finished. 'abort' is incremented if the client abandons
54        the upload.
55
56    get, read
57        these are for immutable file downloads. 'get' is incremented
58        when a client asks if the server has a specific share. 'read' is
59        incremented for each chunk of data read.
60
61    readv, writev
62        these are for immutable file creation, publish, and retrieve. 'readv'
63        is incremented each time a client reads part of a mutable share.
64        'writev' is incremented each time a client sends a modification
65        request.
66
67    add-lease, renew, cancel
68        these are for share lease modifications. 'add-lease' is incremented
69        when an 'add-lease' operation is performed (which either adds a new
70        lease or renews an existing lease). 'renew' is for the 'renew-lease'
71        operation (which can only be used to renew an existing one). 'cancel'
72        is used for the 'cancel-lease' operation.
73
74    bytes_freed
75        this counts how many bytes were freed when a 'cancel-lease'
76        operation removed the last lease from a share and the share
77        was thus deleted.
78
79    bytes_added
80        this counts how many bytes were consumed by immutable share
81        uploads. It is incremented at the same time as the 'close'
82        counter.
83
84**stats.storage_server.\***
85
86    allocated
87        this counts how many bytes are currently 'allocated', which
88        tracks the space that will eventually be consumed by immutable
89        share upload operations. The stat is increased as soon as the
90        upload begins (at the same time the 'allocated' counter is
91        incremented), and goes back to zero when the 'close' or 'abort'
92        message is received (at which point the 'disk_used' stat should
93        incremented by the same amount).
94
95    disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
96        these all reflect disk-space usage policies and status.
97        'disk_total' is the total size of disk where the storage
98        server's BASEDIR/storage/shares directory lives, as reported
99        by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
100        and 'disk_free_for_nonroot' show related information.
101        'reserved_space' reports the reservation configured by the
102        tahoe.cfg [storage]reserved_space value. 'disk_avail'
103        reports the remaining disk space available for the Tahoe
104        server after subtracting reserved_space from disk_avail. All
105        values are in bytes.
106
107    accepting_immutable_shares
108        this is '1' if the storage server is currently accepting uploads of
109        immutable shares. It may be '0' if a server is disabled by
110        configuration, or if the disk is full (i.e. disk_avail is less than
111        reserved_space).
112
113    total_bucket_count
114        this counts the number of 'buckets' (i.e. unique
115        storage-index values) currently managed by the storage
116        server. It indicates roughly how many files are managed
117        by the server.
118
119    latencies.*.*
120        these stats keep track of local disk latencies for
121        storage-server operations. A number of percentile values are
122        tracked for many operations. For example,
123        'storage_server.latencies.readv.50_0_percentile' records the
124        median response time for a 'readv' request. All values are in
125        seconds. These are recorded by the storage server, starting
126        from the time the request arrives (post-deserialization) and
127        ending when the response begins serialization. As such, they
128        are mostly useful for measuring disk speeds. The operations
129        tracked are the same as the counters.storage_server.* counter
130        values (allocate, write, close, get, read, add-lease, renew,
131        cancel, readv, writev). The percentile values tracked are:
132        mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
133        90_0_percentile, 95_0_percentile, 99_0_percentile,
134        99_9_percentile. (the last value, 99.9 percentile, means that
135        999 out of the last 1000 operations were faster than the
136        given number, and is the same threshold used by Amazon's
137        internal SLA, according to the Dynamo paper).
138        Percentiles are only reported in the case of a sufficient
139        number of observations for unambiguous interpretation. For
140        example, the 99.9th percentile is (at the level of thousandths
141        precision) 9 thousandths greater than the 99th
142        percentile for sample sizes greater than or equal to 1000,
143        thus the 99.9th percentile is only reported for samples of 1000
144        or more observations.
145
146
147**counters.uploader.files_uploaded**
148
149**counters.uploader.bytes_uploaded**
150
151**counters.downloader.files_downloaded**
152
153**counters.downloader.bytes_downloaded**
154
155    These count client activity: a Tahoe client will increment these when it
156    uploads or downloads an immutable file. 'files_uploaded' is incremented by
157    one for each operation, while 'bytes_uploaded' is incremented by the size of
158    the file.
159
160**counters.mutable.files_published**
161
162**counters.mutable.bytes_published**
163
164**counters.mutable.files_retrieved**
165
166**counters.mutable.bytes_retrieved**
167
168 These count client activity for mutable files. 'published' is the act of
169 changing an existing mutable file (or creating a brand-new mutable file).
170 'retrieved' is the act of reading its current contents.
171
172**counters.chk_upload_helper.\***
173
174    These count activity of the "Helper", which receives ciphertext from clients
175    and performs erasure-coding and share upload for files that are not already
176    in the grid. The code which implements these counters is in
177    src/allmydata/immutable/offloaded.py .
178
179    upload_requests
180        incremented each time a client asks to upload a file
181        upload_already_present: incremented when the file is already in the grid
182
183    upload_need_upload
184        incremented when the file is not already in the grid
185
186    resumes
187        incremented when the helper already has partial ciphertext for
188        the requested upload, indicating that the client is resuming an
189        earlier upload
190
191    fetched_bytes
192        this counts how many bytes of ciphertext have been fetched
193        from uploading clients
194
195    encoded_bytes
196        this counts how many bytes of ciphertext have been
197        encoded and turned into successfully-uploaded shares. If no
198        uploads have failed or been abandoned, encoded_bytes should
199        eventually equal fetched_bytes.
200
201**stats.chk_upload_helper.\***
202
203    These also track Helper activity:
204
205    active_uploads
206        how many files are currently being uploaded. 0 when idle.
207
208    incoming_count
209        how many cache files are present in the incoming/ directory,
210        which holds ciphertext files that are still being fetched
211        from the client
212
213    incoming_size
214        total size of cache files in the incoming/ directory
215
216    incoming_size_old
217        total size of 'old' cache files (more than 48 hours)
218
219    encoding_count
220        how many cache files are present in the encoding/ directory,
221        which holds ciphertext files that are being encoded and
222        uploaded
223
224    encoding_size
225        total size of cache files in the encoding/ directory
226
227    encoding_size_old
228        total size of 'old' cache files (more than 48 hours)
229
230**stats.node.uptime**
231    how many seconds since the node process was started
232
233**stats.cpu_monitor.\***
234
235    1min_avg, 5min_avg, 15min_avg
236        estimate of what percentage of system CPU time was consumed by the
237        node process, over the given time interval. Expressed as a float, 0.0
238        for 0%, 1.0 for 100%
239
240    total
241        estimate of total number of CPU seconds consumed by node since
242        the process was started. Ticket #472 indicates that .total may
243        sometimes be negative due to wraparound of the kernel's counter.
244
245
246Using Munin To Graph Stats Values
247=================================
248
249The misc/operations_helpers/munin/ directory contains various plugins to
250graph stats for Tahoe nodes. They are intended for use with the Munin_
251system-management tool, which typically polls target systems every 5 minutes
252and produces a web page with graphs of various things over multiple time
253scales (last hour, last month, last year).
254
255Most of the plugins are designed to pull stats from a single Tahoe node, and
256are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
257"tahoe_stats" plugin is designed to read from the JSON file created by the
258stats-gatherer. Some plugins are to be used with the disk watcher, and a few
259(like tahoe_nodememory) are designed to watch the node processes directly
260(and must therefore run on the same host as the target node).
261
262Please see the docstrings at the beginning of each plugin for details, and
263the "tahoe-conf" file for notes about configuration and installing these
264plugins into a Munin environment.
265
266.. _Munin: http://munin-monitoring.org/
267
268
269Scraping Stats Values in OpenMetrics Format
270===========================================
271
272Time Series DataBase (TSDB) software like Prometheus_ and VictoriaMetrics_ can
273parse statistics from the e.g. http://localhost:3456/statistics?t=openmetrics
274URL in OpenMetrics_ format. Software like Grafana_ can then be used to graph
275and alert on these numbers. You can find a pre-configured dashboard for
276Grafana at https://grafana.com/grafana/dashboards/16894-tahoe-lafs/.
277
278.. _OpenMetrics: https://openmetrics.io/
279.. _Prometheus: https://prometheus.io/
280.. _VictoriaMetrics: https://victoriametrics.com/
281.. _Grafana: https://grafana.com/
Note: See TracBrowser for help on using the repository browser.