source: trunk/docs/logging.rst

Last change on this file was f81900e, checked in by Brian Warner <warner@…>, at 2016-03-30T08:26:47Z

format docs for Sphinx

Added indexes, fixed cross-references.

Also a few pip-related cleanups I noticed along the way.

  • Property mode set to 100644
File size: 12.9 KB
Line 
1.. -*- coding: utf-8-with-signature -*-
2
3=============
4Tahoe Logging
5=============
6
71.  `Overview`_
82.  `Realtime Logging`_
93.  `Incidents`_
104.  `Working with flogfiles`_
115.  `Gatherers`_
12
13    1.  `Incident Gatherer`_
14    2.  `Log Gatherer`_
15
166.  `Adding log messages`_
177.  `Log Messages During Unit Tests`_
18
19Overview
20========
21
22Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
23record information about what is happening inside the Tahoe node. This is
24primarily for use by programmers and grid operators who want to find out what
25went wrong.
26
27The Foolscap logging system is documented at
28`<https://github.com/warner/foolscap/blob/latest-release/doc/logging.rst>`__.
29
30The Foolscap distribution includes a utility named "``flogtool``" that is
31used to get access to many Foolscap logging features. ``flogtool`` should get
32installed into the same virtualenv as the ``tahoe`` command.
33
34
35Realtime Logging
36================
37
38When you are working on Tahoe code, and want to see what the node is doing,
39the easiest tool to use is "``flogtool tail``". This connects to the Tahoe
40node and subscribes to hear about all log events. These events are then
41displayed to stdout, and optionally saved to a file.
42
43"``flogtool tail``" connects to the "logport", for which the FURL is stored
44in ``BASEDIR/private/logport.furl`` . The following command will connect to
45this port and start emitting log information::
46
47  flogtool tail BASEDIR/private/logport.furl
48
49The ``--save-to FILENAME`` option will save all received events to a file,
50where then can be examined later with "``flogtool dump``" or "``flogtool
51web-viewer``". The ``--catch-up`` option will ask the node to dump all stored
52events before subscribing to new ones (without ``--catch-up``, you will only
53hear about events that occur after the tool has connected and subscribed).
54
55Incidents
56=========
57
58Foolscap keeps a short list of recent events in memory. When something goes
59wrong, it writes all the history it has (and everything that gets logged in
60the next few seconds) into a file called an "incident". These files go into
61``BASEDIR/logs/incidents/`` , in a file named
62"``incident-TIMESTAMP-UNIQUE.flog.bz2``". The default definition of
63"something goes wrong" is the generation of a log event at the ``log.WEIRD``
64level or higher, but other criteria could be implemented.
65
66The typical "incident report" we've seen in a large Tahoe grid is about 40kB
67compressed, representing about 1800 recent events.
68
69These "flogfiles" have a similar format to the files saved by "``flogtool
70tail --save-to``". They are simply lists of log events, with a small header
71to indicate which event triggered the incident.
72
73The "``flogtool dump FLOGFILE``" command will take one of these ``.flog.bz2``
74files and print their contents to stdout, one line per event. The raw event
75dictionaries can be dumped by using "``flogtool dump --verbose FLOGFILE``".
76
77The "``flogtool web-viewer``" command can be used to examine the flogfile in
78a web browser. It runs a small HTTP server and emits the URL on stdout.  This
79view provides more structure than the output of "``flogtool dump``": the
80parent/child relationships of log events is displayed in a nested format.
81"``flogtool web-viewer``" is still fairly immature.
82
83Working with flogfiles
84======================
85
86The "``flogtool filter``" command can be used to take a large flogfile
87(perhaps one created by the log-gatherer, see below) and copy a subset of
88events into a second file. This smaller flogfile may be easier to work with
89than the original. The arguments to "``flogtool filter``" specify filtering
90criteria: a predicate that each event must match to be copied into the target
91file. ``--before`` and ``--after`` are used to exclude events outside a given
92window of time. ``--above`` will retain events above a certain severity
93level. ``--from`` retains events send by a specific tubid.
94``--strip-facility`` removes events that were emitted with a given facility
95(like ``foolscap.negotiation`` or ``tahoe.upload``).
96
97Gatherers
98=========
99
100In a deployed Tahoe grid, it is useful to get log information automatically
101transferred to a central log-gatherer host. This offloads the (admittedly
102modest) storage requirements to a different host and provides access to
103logfiles from multiple nodes (web-API, storage, or helper) in a single place.
104
105There are two kinds of gatherers: "log gatherer" and "stats gatherer". Each
106produces a FURL which needs to be placed in the ``NODEDIR/tahoe.cfg`` file of
107each node that is to publish to the gatherer, under the keys
108"log_gatherer.furl" and "stats_gatherer.furl" respectively. When the Tahoe
109node starts, it will connect to the configured gatherers and offer its
110logport: the gatherer will then use the logport to subscribe to hear about
111events.
112
113The gatherer will write to files in its working directory, which can then be
114examined with tools like "``flogtool dump``" as described above.
115
116Incident Gatherer
117-----------------
118
119The "incident gatherer" only collects Incidents: records of the log events
120that occurred just before and slightly after some high-level "trigger event"
121was recorded. Each incident is classified into a "category": a short string
122that summarizes what sort of problem took place. These classification
123functions are written after examining a new/unknown incident. The idea is to
124recognize when the same problem is happening multiple times.
125
126A collection of classification functions that are useful for Tahoe nodes are
127provided in ``misc/incident-gatherer/support_classifiers.py`` . There is
128roughly one category for each ``log.WEIRD``-or-higher level event in the
129Tahoe source code.
130
131The incident gatherer is created with the "``flogtool
132create-incident-gatherer WORKDIR``" command, and started with "``tahoe
133start``". The generated "``gatherer.tac``" file should be modified to add
134classifier functions.
135
136The incident gatherer writes incident names (which are simply the relative
137pathname of the ``incident-\*.flog.bz2`` file) into ``classified/CATEGORY``.
138For example, the ``classified/mutable-retrieve-uncoordinated-write-error``
139file contains a list of all incidents which were triggered by an
140uncoordinated write that was detected during mutable file retrieval (caused
141when somebody changed the contents of the mutable file in between the node's
142mapupdate step and the retrieve step). The ``classified/unknown`` file
143contains a list of all incidents that did not match any of the classification
144functions.
145
146At startup, the incident gatherer will automatically reclassify any incident
147report which is not mentioned in any of the ``classified/\*`` files. So the
148usual workflow is to examine the incidents in ``classified/unknown``, add a
149new classification function, delete ``classified/unknown``, then bound the
150gatherer with "``tahoe restart WORKDIR``". The incidents which can be
151classified with the new functions will be added to their own
152``classified/FOO`` lists, and the remaining ones will be put in
153``classified/unknown``, where the process can be repeated until all events
154are classifiable.
155
156The incident gatherer is still fairly immature: future versions will have a
157web interface and an RSS feed, so operations personnel can track problems in
158the storage grid.
159
160In our experience, each incident takes about two seconds to transfer from the
161node that generated it to the gatherer. The gatherer will automatically catch
162up to any incidents which occurred while it is offline.
163
164Log Gatherer
165------------
166
167The "Log Gatherer" subscribes to hear about every single event published by
168the connected nodes, regardless of severity. This server writes these log
169events into a large flogfile that is rotated (closed, compressed, and
170replaced with a new one) on a periodic basis. Each flogfile is named
171according to the range of time it represents, with names like
172"``from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2``". The flogfiles
173contain events from many different sources, making it easier to correlate
174things that happened on multiple machines (such as comparing a client node
175making a request with the storage servers that respond to that request).
176
177Create the Log Gatherer with the "``flogtool create-gatherer WORKDIR``"
178command, and start it with "``tahoe start``". Then copy the contents of the
179``log_gatherer.furl`` file it creates into the ``BASEDIR/tahoe.cfg`` file
180(under the key ``log_gatherer.furl`` of the section ``[node]``) of all nodes
181that should be sending it log events. (See :doc:`configuration`)
182
183The "``flogtool filter``" command, described above, is useful to cut down the
184potentially large flogfiles into a more focussed form.
185
186Busy nodes, particularly web-API nodes which are performing recursive
187deep-size/deep-stats/deep-check operations, can produce a lot of log events.
188To avoid overwhelming the node (and using an unbounded amount of memory for
189the outbound TCP queue), publishing nodes will start dropping log events when
190the outbound queue grows too large. When this occurs, there will be gaps
191(non-sequential event numbers) in the log-gatherer's flogfiles.
192
193Adding log messages
194===================
195
196When adding new code, the Tahoe developer should add a reasonable number of
197new log events. For details, please see the Foolscap logging documentation,
198but a few notes are worth stating here:
199
200* use a facility prefix of "``tahoe.``", like "``tahoe.mutable.publish``"
201
202* assign each severe (``log.WEIRD`` or higher) event a unique message
203  identifier, as the ``umid=`` argument to the ``log.msg()`` call. The
204  ``misc/coding_tools/make_umid`` script may be useful for this purpose.
205  This will make it easier to write a classification function for these
206  messages.
207
208* use the ``parent=`` argument whenever the event is causally/temporally
209  clustered with its parent. For example, a download process that involves
210  three sequential hash fetches could announce the send and receipt of those
211  hash-fetch messages with a ``parent=`` argument that ties them to the
212  overall download process. However, each new web-API download request should
213  be unparented.
214
215* use the ``format=`` argument in preference to the ``message=`` argument.
216  E.g. use ``log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)``
217  instead of ``log.msg("got %d shares, need %d" % (n,k))``. This will allow
218  later tools to analyze the event without needing to scrape/reconstruct the
219  structured data out of the formatted string.
220
221* Pass extra information as extra keyword arguments, even if they aren't
222  included in the ``format=`` string. This information will be displayed in
223  the "``flogtool dump --verbose``" output, as well as being available to
224  other tools. The ``umid=`` argument should be passed this way.
225
226* use ``log.err`` for the catch-all ``addErrback`` that gets attached to the
227  end of any given Deferred chain. When used in conjunction with
228  ``LOGTOTWISTED=1``, ``log.err()`` will tell Twisted about the error-nature
229  of the log message, causing Trial to flunk the test (with an "ERROR"
230  indication that prints a copy of the Failure, including a traceback).
231  Don't use ``log.err`` for events that are ``BAD`` but handled (like hash
232  failures: since these are often deliberately provoked by test code, they
233  should not cause test failures): use ``log.msg(level=BAD)`` for those
234  instead.
235
236
237Log Messages During Unit Tests
238==============================
239
240If a test is failing and you aren't sure why, start by enabling
241``FLOGTOTWISTED=1`` like this::
242
243  make test FLOGTOTWISTED=1
244
245With ``FLOGTOTWISTED=1``, sufficiently-important log events will be written
246into ``_trial_temp/test.log``, which may give you more ideas about why the
247test is failing.
248
249By default, ``_trial_temp/test.log`` will not receive messages below the
250``level=OPERATIONAL`` threshold. You can change the threshold via the ``FLOGLEVEL``
251variable, e.g.::
252
253  make test FLOGLEVEL=10 FLOGTOTWISTED=1
254
255(The level numbers are listed in src/allmydata/util/log.py.)
256
257To look at the detailed foolscap logging messages, run the tests like this::
258
259  make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
260
261The first environment variable will cause foolscap log events to be written
262to ``./flog.out.bz2`` (instead of merely being recorded in the circular
263buffers for the use of remote subscribers or incident reports). The second
264will cause all log events to be written out, not just the higher-severity
265ones. The third will cause twisted log events (like the markers that indicate
266when each unit test is starting and stopping) to be copied into the flogfile,
267making it easier to correlate log events with unit tests.
268
269Enabling this form of logging appears to roughly double the runtime of the
270unit tests. The ``flog.out.bz2`` file is approximately 2MB.
271
272You can then use "``flogtool dump``" or "``flogtool web-viewer``" on the
273resulting ``flog.out`` file.
274
275("``flogtool tail``" and the log-gatherer are not useful during unit tests,
276since there is no single Tub to which all the log messages are published).
277
278It is possible for setting these environment variables to cause spurious test
279failures in tests with race condition bugs. All known instances of this have
280been fixed as of Tahoe-LAFS v1.7.1.
Note: See TracBrowser for help on using the repository browser.