source: trunk/docs/stats.rst

Last change on this file was 93bb3e9, checked in by Brian Warner <warner@…>, at 2016-05-05T00:58:45Z

stats-gatherer: add --hostname/--location/--port

Updates docs, tests, explains how to update an old gatherer.

  • Property mode set to 100644
File size: 14.9 KB
Line 
1.. -*- coding: utf-8-with-signature -*-
2
3================
4Tahoe Statistics
5================
6
71. `Overview`_
82. `Statistics Categories`_
93. `Running a Tahoe Stats-Gatherer Service`_
104. `Using Munin To Graph Stats Values`_
11
12Overview
13========
14
15Each Tahoe node collects and publishes statistics about its operations as it
16runs. These include counters of how many files have been uploaded and
17downloaded, CPU usage information, performance numbers like latency of
18storage server operations, and available disk space.
19
20The easiest way to see the stats for any given node is use the web interface.
21From the main "Welcome Page", follow the "Operational Statistics" link inside
22the small "This Client" box. If the welcome page lives at
23http://localhost:3456/, then the statistics page will live at
24http://localhost:3456/statistics . This presents a summary of the stats
25block, along with a copy of the raw counters. To obtain just the raw counters
26(in JSON format), use /statistics?t=json instead.
27
28Statistics Categories
29=====================
30
31The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
32are strictly counters: they are reset to zero when the node is started, and
33grow upwards. 'stats' are non-incrementing values, used to measure the
34current state of various systems. Some stats are actually booleans, expressed
35as '1' for true and '0' for false (internal restrictions require all stats
36values to be numbers).
37
38Under both the 'counters' and 'stats' dictionaries, each individual stat has
39a key with a dot-separated name, breaking them up into groups like
40'cpu_monitor' and 'storage_server'.
41
42The currently available stats (as of release 1.6.0 or so) are described here:
43
44**counters.storage_server.\***
45
46    this group counts inbound storage-server operations. They are not provided
47    by client-only nodes which have been configured to not run a storage server
48    (with [storage]enabled=false in tahoe.cfg)
49
50    allocate, write, close, abort
51        these are for immutable file uploads. 'allocate' is incremented when a
52        client asks if it can upload a share to the server. 'write' is
53        incremented for each chunk of data written. 'close' is incremented when
54        the share is finished. 'abort' is incremented if the client abandons
55        the upload.
56
57    get, read
58        these are for immutable file downloads. 'get' is incremented
59        when a client asks if the server has a specific share. 'read' is
60        incremented for each chunk of data read.
61
62    readv, writev
63        these are for immutable file creation, publish, and retrieve. 'readv'
64        is incremented each time a client reads part of a mutable share.
65        'writev' is incremented each time a client sends a modification
66        request.
67
68    add-lease, renew, cancel
69        these are for share lease modifications. 'add-lease' is incremented
70        when an 'add-lease' operation is performed (which either adds a new
71        lease or renews an existing lease). 'renew' is for the 'renew-lease'
72        operation (which can only be used to renew an existing one). 'cancel'
73        is used for the 'cancel-lease' operation.
74
75    bytes_freed
76        this counts how many bytes were freed when a 'cancel-lease'
77        operation removed the last lease from a share and the share
78        was thus deleted.
79
80    bytes_added
81        this counts how many bytes were consumed by immutable share
82        uploads. It is incremented at the same time as the 'close'
83        counter.
84
85**stats.storage_server.\***
86
87    allocated
88        this counts how many bytes are currently 'allocated', which
89        tracks the space that will eventually be consumed by immutable
90        share upload operations. The stat is increased as soon as the
91        upload begins (at the same time the 'allocated' counter is
92        incremented), and goes back to zero when the 'close' or 'abort'
93        message is received (at which point the 'disk_used' stat should
94        incremented by the same amount).
95
96    disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
97        these all reflect disk-space usage policies and status.
98        'disk_total' is the total size of disk where the storage
99        server's BASEDIR/storage/shares directory lives, as reported
100        by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
101        and 'disk_free_for_nonroot' show related information.
102        'reserved_space' reports the reservation configured by the
103        tahoe.cfg [storage]reserved_space value. 'disk_avail'
104        reports the remaining disk space available for the Tahoe
105        server after subtracting reserved_space from disk_avail. All
106        values are in bytes.
107
108    accepting_immutable_shares
109        this is '1' if the storage server is currently accepting uploads of
110        immutable shares. It may be '0' if a server is disabled by
111        configuration, or if the disk is full (i.e. disk_avail is less than
112        reserved_space).
113
114    total_bucket_count
115        this counts the number of 'buckets' (i.e. unique
116        storage-index values) currently managed by the storage
117        server. It indicates roughly how many files are managed
118        by the server.
119
120    latencies.*.*
121        these stats keep track of local disk latencies for
122        storage-server operations. A number of percentile values are
123        tracked for many operations. For example,
124        'storage_server.latencies.readv.50_0_percentile' records the
125        median response time for a 'readv' request. All values are in
126        seconds. These are recorded by the storage server, starting
127        from the time the request arrives (post-deserialization) and
128        ending when the response begins serialization. As such, they
129        are mostly useful for measuring disk speeds. The operations
130        tracked are the same as the counters.storage_server.* counter
131        values (allocate, write, close, get, read, add-lease, renew,
132        cancel, readv, writev). The percentile values tracked are:
133        mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
134        90_0_percentile, 95_0_percentile, 99_0_percentile,
135        99_9_percentile. (the last value, 99.9 percentile, means that
136        999 out of the last 1000 operations were faster than the
137        given number, and is the same threshold used by Amazon's
138        internal SLA, according to the Dynamo paper).
139        Percentiles are only reported in the case of a sufficient
140        number of observations for unambiguous interpretation. For
141        example, the 99.9th percentile is (at the level of thousandths
142        precision) 9 thousandths greater than the 99th
143        percentile for sample sizes greater than or equal to 1000,
144        thus the 99.9th percentile is only reported for samples of 1000
145        or more observations.
146
147
148**counters.uploader.files_uploaded**
149
150**counters.uploader.bytes_uploaded**
151
152**counters.downloader.files_downloaded**
153
154**counters.downloader.bytes_downloaded**
155
156    These count client activity: a Tahoe client will increment these when it
157    uploads or downloads an immutable file. 'files_uploaded' is incremented by
158    one for each operation, while 'bytes_uploaded' is incremented by the size of
159    the file.
160
161**counters.mutable.files_published**
162
163**counters.mutable.bytes_published**
164
165**counters.mutable.files_retrieved**
166
167**counters.mutable.bytes_retrieved**
168
169 These count client activity for mutable files. 'published' is the act of
170 changing an existing mutable file (or creating a brand-new mutable file).
171 'retrieved' is the act of reading its current contents.
172
173**counters.chk_upload_helper.\***
174
175    These count activity of the "Helper", which receives ciphertext from clients
176    and performs erasure-coding and share upload for files that are not already
177    in the grid. The code which implements these counters is in
178    src/allmydata/immutable/offloaded.py .
179
180    upload_requests
181        incremented each time a client asks to upload a file
182        upload_already_present: incremented when the file is already in the grid
183
184    upload_need_upload
185        incremented when the file is not already in the grid
186
187    resumes
188        incremented when the helper already has partial ciphertext for
189        the requested upload, indicating that the client is resuming an
190        earlier upload
191
192    fetched_bytes
193        this counts how many bytes of ciphertext have been fetched
194        from uploading clients
195
196    encoded_bytes
197        this counts how many bytes of ciphertext have been
198        encoded and turned into successfully-uploaded shares. If no
199        uploads have failed or been abandoned, encoded_bytes should
200        eventually equal fetched_bytes.
201
202**stats.chk_upload_helper.\***
203
204    These also track Helper activity:
205
206    active_uploads
207        how many files are currently being uploaded. 0 when idle.
208
209    incoming_count
210        how many cache files are present in the incoming/ directory,
211        which holds ciphertext files that are still being fetched
212        from the client
213
214    incoming_size
215        total size of cache files in the incoming/ directory
216
217    incoming_size_old
218        total size of 'old' cache files (more than 48 hours)
219
220    encoding_count
221        how many cache files are present in the encoding/ directory,
222        which holds ciphertext files that are being encoded and
223        uploaded
224
225    encoding_size
226        total size of cache files in the encoding/ directory
227
228    encoding_size_old
229        total size of 'old' cache files (more than 48 hours)
230
231**stats.node.uptime**
232    how many seconds since the node process was started
233
234**stats.cpu_monitor.\***
235
236    1min_avg, 5min_avg, 15min_avg
237        estimate of what percentage of system CPU time was consumed by the
238        node process, over the given time interval. Expressed as a float, 0.0
239        for 0%, 1.0 for 100%
240
241    total
242        estimate of total number of CPU seconds consumed by node since
243        the process was started. Ticket #472 indicates that .total may
244        sometimes be negative due to wraparound of the kernel's counter.
245
246**stats.load_monitor.\***
247
248    When enabled, the "load monitor" continually schedules a one-second
249    callback, and measures how late the response is. This estimates system load
250    (if the system is idle, the response should be on time). This is only
251    enabled if a stats-gatherer is configured.
252
253    avg_load
254        average "load" value (seconds late) over the last minute
255
256    max_load
257        maximum "load" value over the last minute
258
259
260Running a Tahoe Stats-Gatherer Service
261======================================
262
263The "stats-gatherer" is a simple daemon that periodically collects stats from
264several tahoe nodes. It could be useful, e.g., in a production environment,
265where you want to monitor dozens of storage servers from a central management
266host. It merely gatherers statistics from many nodes into a single place: it
267does not do any actual analysis.
268
269The stats gatherer listens on a network port using the same Foolscap_
270connection library that Tahoe clients use to connect to storage servers.
271Tahoe nodes can be configured to connect to the stats gatherer and publish
272their stats on a periodic basis. (In fact, what happens is that nodes connect
273to the gatherer and offer it a second FURL which points back to the node's
274"stats port", which the gatherer then uses to pull stats on a periodic basis.
275The initial connection is flipped to allow the nodes to live behind NAT
276boxes, as long as the stats-gatherer has a reachable IP address.)
277
278.. _Foolscap: https://foolscap.lothar.com/trac
279
280The stats-gatherer is created in the same fashion as regular tahoe client
281nodes and introducer nodes. Choose a base directory for the gatherer to live
282in (but do not create the directory). Choose the hostname that should be
283advertised in the gatherer's FURL. Then run:
284
285::
286
287   tahoe create-stats-gatherer --hostname=HOSTNAME $BASEDIR
288
289and start it with "tahoe start $BASEDIR". Once running, the gatherer will
290write a FURL into $BASEDIR/stats_gatherer.furl .
291
292To configure a Tahoe client/server node to contact the stats gatherer, copy
293this FURL into the node's tahoe.cfg file, in a section named "[client]",
294under a key named "stats_gatherer.furl", like so:
295
296::
297
298    [client]
299    stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@HOSTNAME:PORTNUM/wxycb4kaexzskubjnauxeoptympyf45y
300
301or simply copy the stats_gatherer.furl file into the node's base directory
302(next to the tahoe.cfg file): it will be interpreted in the same way.
303
304When the gatherer is created, it will allocate a random unused TCP port, so
305it should not conflict with anything else that you have running on that host
306at that time. To explicitly control which port it uses, run the creation
307command with ``--location=`` and ``--port=`` instead of ``--hostname=``. If
308you use a hostname of ``example.org`` and a port number of ``1234``, then
309run::
310
311  tahoe create-stats-gatherer --location=tcp:example.org:1234 --port=tcp:1234
312
313``--location=`` is a Foolscap FURL hints string (so it can be a
314comma-separated list of connection hints), and ``--port=`` is a Twisted
315"server endpoint specification string", as described in :doc:`configuration`.
316
317Once running, the stats gatherer will create a standard JSON file in
318``$BASEDIR/stats.json``. Once a minute, the gatherer will pull stats
319information from every connected node and write them into the file. The file
320will contain a dictionary, in which node identifiers (known as "tubid"
321strings) are the keys, and the values are a dict with 'timestamp',
322'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
323dictionary as made available at http://localhost:3456/statistics?t=json . The
324file will only contain the most recent update from each node.
325
326Other tools can be built to examine these stats and render them into
327something useful. For example, a tool could sum the
328"storage_server.disk_avail' values from all servers to compute a
329total-disk-available number for the entire grid (however, the "disk watcher"
330daemon, in misc/operations_helpers/spacetime/, is better suited for this
331specific task).
332
333Using Munin To Graph Stats Values
334=================================
335
336The misc/operations_helpers/munin/ directory contains various plugins to
337graph stats for Tahoe nodes. They are intended for use with the Munin_
338system-management tool, which typically polls target systems every 5 minutes
339and produces a web page with graphs of various things over multiple time
340scales (last hour, last month, last year).
341
342Most of the plugins are designed to pull stats from a single Tahoe node, and
343are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
344"tahoe_stats" plugin is designed to read from the JSON file created by the
345stats-gatherer. Some plugins are to be used with the disk watcher, and a few
346(like tahoe_nodememory) are designed to watch the node processes directly
347(and must therefore run on the same host as the target node).
348
349Please see the docstrings at the beginning of each plugin for details, and
350the "tahoe-conf" file for notes about configuration and installing these
351plugins into a Munin environment.
352
353.. _Munin: http://munin-monitoring.org/
Note: See TracBrowser for help on using the repository browser.