No subject

Fri Dec 18 15:46:51 PST 2009

tics&quot; link inside<br>
the small &quot;This Client&quot; box. If the welcome page lives at<br>
<a href=3D"http://localhost:3456/" target=3D"_blank">http://localhost:3456/=
</a>, then the statistics page will live at<br>
<a href=3D"http://localhost:3456/statistics" target=3D"_blank">http://local=
host:3456/statistics</a> . This presents a summary of the stats<br>
block, along with a copy of the raw counters. To obtain just the raw counte=
rs<br>
(in JSON format), use /statistics?t=3Djson instead.<br>
<br>
=3D Statistics Categories =3D<br>
<br>
The stats dictionary contains two keys: &#39;counters&#39; and &#39;stats&#=
39;. &#39;counters&#39;<br>
are strictly counters: they are reset to zero when the node is started, and=
<br>
grow upwards. &#39;stats&#39; are non-incrementing values, used to measure =
the<br>
current state of various systems. Some stats are actually booleans, express=
ed<br>
as &#39;1&#39; for true and &#39;0&#39; for false (internal restrictions re=
quire all stats<br>
values to be numbers).<br>
<br>
Under both the &#39;counters&#39; and &#39;stats&#39; dictionaries, each in=
dividual stat has<br>
a key with a dot-separated name, breaking them up into groups like<br>
&#39;cpu_monitor&#39; and &#39;storage_server&#39;.<br>
<br>
The currently available stats (as of release 1.6.0 or so) are described her=
e:<br>
<br>
counters.storage_server.*: this group counts inbound storage-server<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 operations. They are n=
ot provided by client-only<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nodes which have been =
configured to not run a<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 storage server (with [=
storage]enabled=3Dfalse in<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tahoe.cfg)<br>
 =A0allocate, write, close, abort: these are for immutable file uploads.<br=
>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &#39;alloc=
ate&#39; is incremented when a client asks<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if it can =
upload a share to the server.<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &#39;write=
&#39; is incremented for each chunk of<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 data writt=
en. &#39;close&#39; is incremented when<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 the share =
is finished. &#39;abort&#39; is<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 incremente=
d if the client abandons the<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 uploaed.<b=
r>
 =A0get, read: these are for immutable file downloads. &#39;get&#39; is inc=
remented<br>
 =A0 =A0 =A0 =A0 =A0 =A0 when a client asks if the server has a specific sh=
are. &#39;read&#39; is<br>
 =A0 =A0 =A0 =A0 =A0 =A0 incremented for each chunk of data read.<br>
 =A0readv, writev: these are for immutable file creation, publish, and<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 retrieve. &#39;readv&#39; is incremented e=
ach time a client reads<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 part of a mutable share. &#39;writev&#39; =
is incremented each time a<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 client sends a modification request.<br>
 =A0add-lease, renew, cancel: these are for share lease modifications.<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&#39;add-lease&#39;=
 is incremented when an &#39;add-lease&#39;<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0operation is perfor=
med (which either adds a new<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lease or renews an =
existing lease). &#39;renew&#39; is<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for the &#39;renew-=
lease&#39; operation (which can only<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0be used to renew an=
 existing one). &#39;cancel&#39; is<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0used for the &#39;c=
ancel-lease&#39; operation.<br>
 =A0bytes_freed: this counts how many bytes were freed when a &#39;cancel-l=
ease&#39;<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 operation removed the last lease from a share =
and the share<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 was thus deleted.<br>
 =A0bytes_added: this counts how many bytes were consumed by immutable shar=
e<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 uploads. It is incremented at the same time as=
 the &#39;close&#39;<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 counter.<br>
<br>
stats.storage_server.*:<br>
=A0allocated: this counts how many bytes are currently &#39;allocated&#39;,=
 which<br>
 =A0 =A0 =A0 =A0 =A0 =A0tracks the space that will eventually be consumed b=
y immutable<br>
 =A0 =A0 =A0 =A0 =A0 =A0share upload operations. The stat is increased as s=
oon as the<br>
 =A0 =A0 =A0 =A0 =A0 =A0upload begins (at the same time the &#39;allocated&=
#39; counter is<br>
 =A0 =A0 =A0 =A0 =A0 =A0incremented), and goes back to zero when the &#39;c=
lose&#39; or &#39;abort&#39;<br>
 =A0 =A0 =A0 =A0 =A0 =A0message is received (at which point the &#39;disk_u=
sed&#39; stat should<br>
 =A0 =A0 =A0 =A0 =A0 =A0incremented by the same amount).<br>
=A0disk_total<br>
=A0disk_used<br>
=A0disk_free_for_root<br>
=A0disk_free_for_nonroot<br>
=A0disk_avail<br>
=A0reserved_space: these all reflect disk-space usage policies and status.<=
br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &#39;disk_total&#39; is the total size of =
disk where the storage<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 server&#39;s BASEDIR/storage/shares direct=
ory lives, as reported<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 by /bin/df or equivalent. &#39;disk_used&#=
39;, &#39;disk_free_for_root&#39;,<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 and &#39;disk_free_for_nonroot&#39; show r=
elated information.<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &#39;reserved_space&#39; reports the reser=
vation configured by the<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tahoe.cfg [storage]reserved_space value. &=
#39;disk_avail&#39;<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 reports the remaining disk space available=
 for the Tahoe<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 server after subtracting reserved_space fr=
om disk_avail. All<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 values are in bytes.<br>
=A0accepting_immutable_shares: this is &#39;1&#39; if the storage server is=
 currently<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 accepting uploads =
of immutable shares. It may be<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &#39;0&#39; if a s=
erver is disabled by configuration, or<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if the disk is ful=
l (i.e. disk_avail is less<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 than reserved_spac=
e).<br>
=A0total_bucket_count: this counts the number of &#39;buckets&#39; (i.e. un=
ique<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 storage-index values) currently ma=
naged by the storage<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 server. It indicates roughly how m=
any files are managed<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 by the server.<br>
=A0latencies.*.*: these stats keep track of local disk latencies for<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0storage-server operations. A number of perc=
entile values are<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tracked for many operations. For example,<b=
r>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&#39;storage_server.latencies.readv.50_0_pe=
rcentile&#39; records the<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0median response time for a &#39;readv&#39; =
request. All values are in<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0seconds. These are recorded by the storage =
server, starting<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0from the time the request arrives (post-des=
erialization) and<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ending when the response begins serializati=
on. As such, they<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0are mostly useful for measuring disk speeds=
. The operations<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tracked are the same as the counters.storag=
e_server.* counter<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0values (allocate, write, close, get, read, =
add-lease, renew,<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cancel, readv, writev). The percentile valu=
es tracked are:<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mean, 01_0_percentile, 10_0_percentile, 50_=
0_percentile,<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A090_0_percentile, 95_0_percentile, 99_0_perc=
entile,<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A099_9_percentile. (the last value, 99.9 perc=
entile, means that<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0999 out of the last 1000 operations were fa=
ster than the<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0given number, and is the same threshold use=
d by Amazon&#39;s<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0internal SLA, according to the Dynamo paper=
).<br>
<br>
counters.uploader.files_uploaded<br>
counters.uploader.bytes_uploaded<br>
counters.downloader.files_downloaded<br>
counters.downloader.bytes_downloaded<br>
<br>
=A0These count client activity: a Tahoe client will increment these when it=
<br>
=A0uploads or downloads an immutable file. &#39;files_uploaded&#39; is incr=
emented by<br>
=A0one for each operation, while &#39;bytes_uploaded&#39; is incremented by=
 the size of<br>
=A0the file.<br>
<br>
counters.mutable.files_published<br>
counters.mutable.bytes_published<br>
counters.mutable.files_retrieved<br>
counters.mutable.bytes_retrieved<br>
<br>
=A0These count client activity for mutable files. &#39;published&#39; is th=
e act of<br>
=A0changing an existing mutable file (or creating a brand-new mutable file)=
.<br>
=A0&#39;retrieved&#39; is the act of reading its current contents.<br>
<br>
counters.chk_upload_helper.*<br>
<br>
=A0These count activity of the &quot;Helper&quot;, which receives ciphertex=
t from clients<br>
=A0and performs erasure-coding and share upload for files that are not alre=
ady<br>
=A0in the grid. The code which implements these counters is in<br>
=A0src/allmydata/immutable/offloaded.py .<br>
<br>
 =A0upload_requests: incremented each time a client asks to upload a file<b=
r>
 =A0upload_already_present: incremented when the file is already in the gri=
d<br>
 =A0upload_need_upload: incremented when the file is not already in the gri=
d<br>
 =A0resumes: incremented when the helper already has partial ciphertext for=
<br>
 =A0 =A0 =A0 =A0 =A0 the requested upload, indicating that the client is re=
suming an<br>
 =A0 =A0 =A0 =A0 =A0 earlier upload<br>
 =A0fetched_bytes: this counts how many bytes of ciphertext have been fetch=
ed<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 from uploading clients<br>
 =A0encoded_bytes: this counts how many bytes of ciphertext have been<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 encoded and turned into successfully-uploa=
ded shares. If no<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 uploads have failed or been abandoned, enc=
oded_bytes should<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 eventually equal fetched_bytes.<br>
<br>
stats.chk_upload_helper.*<br>
<br>
=A0These also track Helper activity:<br>
<br>
 =A0active_uploads: how many files are currently being uploaded. 0 when idl=
e.<br>
 =A0incoming_count: how many cache files are present in the incoming/ direc=
tory,<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0which holds ciphertext files that are s=
till being fetched<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0from the client<br>
 =A0incoming_size: total size of cache files in the incoming/ directory<br>
 =A0incoming_size_old: total size of &#39;old&#39; cache files (more than 4=
8 hours)<br>
 =A0encoding_count: how many cache files are present in the encoding/ direc=
tory,<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0which holds ciphertext files that are b=
eing encoded and<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0uploaded<br>
 =A0encoding_size: total size of cache files in the encoding/ directory<br>
 =A0encoding_size_old: total size of &#39;old&#39; cache files (more than 4=
8 hours)<br>
<br>
stats.node.uptime: how many seconds since the node process was started<br>
<br>
stats.cpu_monitor.*:<br>
 =A0.1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system C=
PU<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0time wa=
s consumed by the node process, over<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the giv=
en time interval. Expressed as a<br>
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0float, =
0.0 for 0%, 1.0 for 100%<br>
 =A0.total: estimate of total number of CPU seconds consumed by node since<=
br>
 =A0 =A0 =A0 =A0 =A0the process was started. Ticket #472 indicates that .to=
tal may<br>
 =A0 =A0 =A0 =A0 =A0sometimes be negative due to wraparound of the kernel&#=
39;s counter.<br>
<br>
stats.load_monitor.*:<br>
=A0When enabled, the &quot;load monitor&quot; continually schedules a one-s=
econd<br>
=A0callback, and measures how late the response is. This estimates system l=
oad<br>
=A0(if the system is idle, the response should be on time). This is only<br=
>
=A0enabled if a stats-gatherer is configured.<br>
<br>
=A0.avg_load: average &quot;load&quot; value (seconds late) over the last m=
inute<br>
=A0.max_load: maximum &quot;load&quot; value over the last minute<br>
<br>
<br>
=3D Running a Tahoe Stats-Gatherer Service =3D<br>
<br>
The &quot;stats-gatherer&quot; is a simple daemon that periodically collect=
s stats from<br>
several tahoe nodes. It could be useful, e.g., in a production environment,=
<br>
where you want to monitor dozens of storage servers from a central manageme=
nt<br>
host.<br>
<br>
The stats gatherer listens on a network port using the same Foolscap<br>
connection library that Tahoe clients use to connect to storage servers.<br=
>
Tahoe nodes can be configured to connect to the stats gatherer and publish<=
br>
their stats on a periodic basis. (in fact, what happens is that nodes conne=
ct<br>
to the gatherer and offer it a second FURL which points back to the node&#3=
9;s<br>
&quot;stats port&quot;, which the gatherer then uses to pull stats on a per=
iodic basis.<br>
The initial connection is flipped to allow the nodes to live behind NAT<br>
boxes, as long as the stats-gatherer has a reachable IP address)<br>
<br>
The stats-gatherer is created in the same fashion as regular tahoe client<b=
r>
nodes and introducer nodes. Choose a base directory for the gatherer to liv=
e<br>
in (but do not create the directory). Then run:<br>
<br>
=A0tahoe create-stats-gatherer $BASEDIR<br>
<br>
and start it with &quot;tahoe start $BASEDIR&quot;. Once running, the gathe=
rer will<br>
write a FURL into $BASEDIR/stats_gatherer.furl .<br>
<br>
To configure a Tahoe client/server node to contact the stats gatherer, copy=
<br>
this FURL into the node&#39;s tahoe.cfg file, in a section named &quot;[cli=
ent]&quot;,<br>
under a key named &quot;stats_gatherer.furl&quot;, like so:<br>
<br>
=A0[client]<br>
=A0stats_gatherer.furl =3D pb://<a href=3D"http://qbo4ktl667zmtiuou6lwbjryl=
i2brv6t at 192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y" target=3D"_blan=
k">qbo4ktl667zmtiuou6lwbjryli2brv6t at 192.168.0.8:49997/wxycb4kaexzskubjnauxe=
optympyf45y</a><br>

or simply copy the stats_gatherer.furl file into the node&#39;s base direct=
ory 
(next to the tahoe.cfg file): it will be interpreted in the same way. 
 
Once running, the stats gatherer will create a standard python &quot;pickle=
&quot; file 
in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats 
information from every connected node and write them into the pickle. The<b=
r>
pickle will contain a dictionary, in which node identifiers (known as &quot=
;tubid&quot; 
strings) are the keys, and the values are a dict with &#39;timestamp&#39;,<=
br>
&#39;nickname&#39;, and &#39;stats&#39; keys. d[tubid][stats] will contain =
the stats 
dictionary as made available at <a href=3D"http://localhost:3456/statistics=
?t=3Djson" target=3D"_blank">http://localhost:3456/statistics?t=3Djson</a> =
. The 
pickle file will only contain the most recent update from each node. 
 
Other tools can be built to examine these stats and render them into 
something useful. For example, a tool could sum the 
&quot;storage_server.disk_avail&#39; values from all servers to compute a<b=
r>
total-disk-available number for the entire grid (however, the &quot;disk wa=
tcher&quot; 
daemon, in misc/spacetime/, is better suited for this specific task). 
 
=3D Using Munin To Graph Stats Values =3D 
 
The misc/munin/ directory contains various plugins to graph stats for Tahoe=
 
nodes. They are intended for use with the Munin system-management tool, whi=
ch 
typically polls target systems every 5 minutes and produces a web page with=
 
graphs of various things over multiple time scales (last hour, last month,<=
br>
last year). 
 
Most of the plugins are designed to pull stats from a single Tahoe node, an=
d 
are configured with the <a href=3D"http://localhost:3456/statistics?t=3Djso=
n" target=3D"_blank">http://localhost:3456/statistics?t=3Djson</a> URL. The=
 
&quot;tahoe_stats&quot; plugin is designed to read from the pickle file cre=
ated by the 
stats-gatherer. Some are to be used with the disk watcher, and a few (like<=
br>
tahoe_nodememory) are designed to watch the node processes directly (and mu=
st 
therefore run on the same host as the target node). 
 
Please see the docstrings at the beginning of each plugin for details, and<=
br>
the &quot;tahoe-conf&quot; file for notes about configuration and installin=
g these 
plugins into a Munin environment. 
 _______________________________________________ 
tahoe-dev mailing list 
<a href=3D"mailto:tahoe-dev at allmydata.org">tahoe-dev at allmydata.org</a> 
<a href=3D"http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev" target=
=3D"_blank">http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev</a> 
 </blockquote></div> </div>

--000e0cd6a73a83d90e047b5ef099--