No subject
Fri Dec 18 15:46:51 PST 2009
tics" link inside<br>
the small "This Client" box. If the welcome page lives at<br>
<a href=3D"http://localhost:3456/" target=3D"_blank">http://localhost:3456/=
</a>, then the statistics page will live at<br>
<a href=3D"http://localhost:3456/statistics" target=3D"_blank">http://local=
host:3456/statistics</a> . This presents a summary of the stats<br>
block, along with a copy of the raw counters. To obtain just the raw counte=
rs<br>
(in JSON format), use /statistics?t=3Djson instead.<br>
<br>
=3D Statistics Categories =3D<br>
<br>
The stats dictionary contains two keys: 'counters' and 'stats&#=
39;. 'counters'<br>
are strictly counters: they are reset to zero when the node is started, and=
<br>
grow upwards. 'stats' are non-incrementing values, used to measure =
the<br>
current state of various systems. Some stats are actually booleans, express=
ed<br>
as '1' for true and '0' for false (internal restrictions re=
quire all stats<br>
values to be numbers).<br>
<br>
Under both the 'counters' and 'stats' dictionaries, each in=
dividual stat has<br>
a key with a dot-separated name, breaking them up into groups like<br>
'cpu_monitor' and 'storage_server'.<br>
<br>
The currently available stats (as of release 1.6.0 or so) are described her=
e:<br>
<br>
counters.storage_server.*: this group counts inbound storage-server<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 operations. They are n=
ot provided by client-only<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nodes which have been =
configured to not run a<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 storage server (with [=
storage]enabled=3Dfalse in<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tahoe.cfg)<br>
=A0allocate, write, close, abort: these are for immutable file uploads.<br=
>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 'alloc=
ate' is incremented when a client asks<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if it can =
upload a share to the server.<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 'write=
' is incremented for each chunk of<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 data writt=
en. 'close' is incremented when<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 the share =
is finished. 'abort' is<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 incremente=
d if the client abandons the<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 uploaed.<b=
r>
=A0get, read: these are for immutable file downloads. 'get' is inc=
remented<br>
=A0 =A0 =A0 =A0 =A0 =A0 when a client asks if the server has a specific sh=
are. 'read' is<br>
=A0 =A0 =A0 =A0 =A0 =A0 incremented for each chunk of data read.<br>
=A0readv, writev: these are for immutable file creation, publish, and<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 retrieve. 'readv' is incremented e=
ach time a client reads<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 part of a mutable share. 'writev' =
is incremented each time a<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 client sends a modification request.<br>
=A0add-lease, renew, cancel: these are for share lease modifications.<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0'add-lease'=
is incremented when an 'add-lease'<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0operation is perfor=
med (which either adds a new<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0lease or renews an =
existing lease). 'renew' is<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for the 'renew-=
lease' operation (which can only<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0be used to renew an=
existing one). 'cancel' is<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0used for the 'c=
ancel-lease' operation.<br>
=A0bytes_freed: this counts how many bytes were freed when a 'cancel-l=
ease'<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 operation removed the last lease from a share =
and the share<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 was thus deleted.<br>
=A0bytes_added: this counts how many bytes were consumed by immutable shar=
e<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 uploads. It is incremented at the same time as=
the 'close'<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 counter.<br>
<br>
stats.storage_server.*:<br>
=A0allocated: this counts how many bytes are currently 'allocated',=
which<br>
=A0 =A0 =A0 =A0 =A0 =A0tracks the space that will eventually be consumed b=
y immutable<br>
=A0 =A0 =A0 =A0 =A0 =A0share upload operations. The stat is increased as s=
oon as the<br>
=A0 =A0 =A0 =A0 =A0 =A0upload begins (at the same time the 'allocated&=
#39; counter is<br>
=A0 =A0 =A0 =A0 =A0 =A0incremented), and goes back to zero when the 'c=
lose' or 'abort'<br>
=A0 =A0 =A0 =A0 =A0 =A0message is received (at which point the 'disk_u=
sed' stat should<br>
=A0 =A0 =A0 =A0 =A0 =A0incremented by the same amount).<br>
=A0disk_total<br>
=A0disk_used<br>
=A0disk_free_for_root<br>
=A0disk_free_for_nonroot<br>
=A0disk_avail<br>
=A0reserved_space: these all reflect disk-space usage policies and status.<=
br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 'disk_total' is the total size of =
disk where the storage<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 server's BASEDIR/storage/shares direct=
ory lives, as reported<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 by /bin/df or equivalent. 'disk_used&#=
39;, 'disk_free_for_root',<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 and 'disk_free_for_nonroot' show r=
elated information.<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 'reserved_space' reports the reser=
vation configured by the<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 tahoe.cfg [storage]reserved_space value. &=
#39;disk_avail'<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 reports the remaining disk space available=
for the Tahoe<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 server after subtracting reserved_space fr=
om disk_avail. All<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 values are in bytes.<br>
=A0accepting_immutable_shares: this is '1' if the storage server is=
currently<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 accepting uploads =
of immutable shares. It may be<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 '0' if a s=
erver is disabled by configuration, or<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if the disk is ful=
l (i.e. disk_avail is less<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 than reserved_spac=
e).<br>
=A0total_bucket_count: this counts the number of 'buckets' (i.e. un=
ique<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 storage-index values) currently ma=
naged by the storage<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 server. It indicates roughly how m=
any files are managed<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 by the server.<br>
=A0latencies.*.*: these stats keep track of local disk latencies for<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0storage-server operations. A number of perc=
entile values are<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tracked for many operations. For example,<b=
r>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0'storage_server.latencies.readv.50_0_pe=
rcentile' records the<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0median response time for a 'readv' =
request. All values are in<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0seconds. These are recorded by the storage =
server, starting<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0from the time the request arrives (post-des=
erialization) and<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ending when the response begins serializati=
on. As such, they<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0are mostly useful for measuring disk speeds=
. The operations<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0tracked are the same as the counters.storag=
e_server.* counter<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0values (allocate, write, close, get, read, =
add-lease, renew,<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cancel, readv, writev). The percentile valu=
es tracked are:<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mean, 01_0_percentile, 10_0_percentile, 50_=
0_percentile,<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A090_0_percentile, 95_0_percentile, 99_0_perc=
entile,<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A099_9_percentile. (the last value, 99.9 perc=
entile, means that<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0999 out of the last 1000 operations were fa=
ster than the<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0given number, and is the same threshold use=
d by Amazon's<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0internal SLA, according to the Dynamo paper=
).<br>
<br>
counters.uploader.files_uploaded<br>
counters.uploader.bytes_uploaded<br>
counters.downloader.files_downloaded<br>
counters.downloader.bytes_downloaded<br>
<br>
=A0These count client activity: a Tahoe client will increment these when it=
<br>
=A0uploads or downloads an immutable file. 'files_uploaded' is incr=
emented by<br>
=A0one for each operation, while 'bytes_uploaded' is incremented by=
the size of<br>
=A0the file.<br>
<br>
counters.mutable.files_published<br>
counters.mutable.bytes_published<br>
counters.mutable.files_retrieved<br>
counters.mutable.bytes_retrieved<br>
<br>
=A0These count client activity for mutable files. 'published' is th=
e act of<br>
=A0changing an existing mutable file (or creating a brand-new mutable file)=
.<br>
=A0'retrieved' is the act of reading its current contents.<br>
<br>
counters.chk_upload_helper.*<br>
<br>
=A0These count activity of the "Helper", which receives ciphertex=
t from clients<br>
=A0and performs erasure-coding and share upload for files that are not alre=
ady<br>
=A0in the grid. The code which implements these counters is in<br>
=A0src/allmydata/immutable/offloaded.py .<br>
<br>
=A0upload_requests: incremented each time a client asks to upload a file<b=
r>
=A0upload_already_present: incremented when the file is already in the gri=
d<br>
=A0upload_need_upload: incremented when the file is not already in the gri=
d<br>
=A0resumes: incremented when the helper already has partial ciphertext for=
<br>
=A0 =A0 =A0 =A0 =A0 the requested upload, indicating that the client is re=
suming an<br>
=A0 =A0 =A0 =A0 =A0 earlier upload<br>
=A0fetched_bytes: this counts how many bytes of ciphertext have been fetch=
ed<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 from uploading clients<br>
=A0encoded_bytes: this counts how many bytes of ciphertext have been<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 encoded and turned into successfully-uploa=
ded shares. If no<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 uploads have failed or been abandoned, enc=
oded_bytes should<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 eventually equal fetched_bytes.<br>
<br>
stats.chk_upload_helper.*<br>
<br>
=A0These also track Helper activity:<br>
<br>
=A0active_uploads: how many files are currently being uploaded. 0 when idl=
e.<br>
=A0incoming_count: how many cache files are present in the incoming/ direc=
tory,<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0which holds ciphertext files that are s=
till being fetched<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0from the client<br>
=A0incoming_size: total size of cache files in the incoming/ directory<br>
=A0incoming_size_old: total size of 'old' cache files (more than 4=
8 hours)<br>
=A0encoding_count: how many cache files are present in the encoding/ direc=
tory,<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0which holds ciphertext files that are b=
eing encoded and<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0uploaded<br>
=A0encoding_size: total size of cache files in the encoding/ directory<br>
=A0encoding_size_old: total size of 'old' cache files (more than 4=
8 hours)<br>
<br>
stats.node.uptime: how many seconds since the node process was started<br>
<br>
stats.cpu_monitor.*:<br>
=A0.1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system C=
PU<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0time wa=
s consumed by the node process, over<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0the giv=
en time interval. Expressed as a<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0float, =
0.0 for 0%, 1.0 for 100%<br>
=A0.total: estimate of total number of CPU seconds consumed by node since<=
br>
=A0 =A0 =A0 =A0 =A0the process was started. Ticket #472 indicates that .to=
tal may<br>
=A0 =A0 =A0 =A0 =A0sometimes be negative due to wraparound of the kernel&#=
39;s counter.<br>
<br>
stats.load_monitor.*:<br>
=A0When enabled, the "load monitor" continually schedules a one-s=
econd<br>
=A0callback, and measures how late the response is. This estimates system l=
oad<br>
=A0(if the system is idle, the response should be on time). This is only<br=
>
=A0enabled if a stats-gatherer is configured.<br>
<br>
=A0.avg_load: average "load" value (seconds late) over the last m=
inute<br>
=A0.max_load: maximum "load" value over the last minute<br>
<br>
<br>
=3D Running a Tahoe Stats-Gatherer Service =3D<br>
<br>
The "stats-gatherer" is a simple daemon that periodically collect=
s stats from<br>
several tahoe nodes. It could be useful, e.g., in a production environment,=
<br>
where you want to monitor dozens of storage servers from a central manageme=
nt<br>
host.<br>
<br>
The stats gatherer listens on a network port using the same Foolscap<br>
connection library that Tahoe clients use to connect to storage servers.<br=
>
Tahoe nodes can be configured to connect to the stats gatherer and publish<=
br>
their stats on a periodic basis. (in fact, what happens is that nodes conne=
ct<br>
to the gatherer and offer it a second FURL which points back to the node=
9;s<br>
"stats port", which the gatherer then uses to pull stats on a per=
iodic basis.<br>
The initial connection is flipped to allow the nodes to live behind NAT<br>
boxes, as long as the stats-gatherer has a reachable IP address)<br>
<br>
The stats-gatherer is created in the same fashion as regular tahoe client<b=
r>
nodes and introducer nodes. Choose a base directory for the gatherer to liv=
e<br>
in (but do not create the directory). Then run:<br>
<br>
=A0tahoe create-stats-gatherer $BASEDIR<br>
<br>
and start it with "tahoe start $BASEDIR". Once running, the gathe=
rer will<br>
write a FURL into $BASEDIR/stats_gatherer.furl .<br>
<br>
To configure a Tahoe client/server node to contact the stats gatherer, copy=
<br>
this FURL into the node's tahoe.cfg file, in a section named "[cli=
ent]",<br>
under a key named "stats_gatherer.furl", like so:<br>
<br>
=A0[client]<br>
=A0stats_gatherer.furl =3D pb://<a href=3D"http://qbo4ktl667zmtiuou6lwbjryl=
i2brv6t at 192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y" target=3D"_blan=
k">qbo4ktl667zmtiuou6lwbjryli2brv6t at 192.168.0.8:49997/wxycb4kaexzskubjnauxe=
optympyf45y</a><br>
<br>
or simply copy the stats_gatherer.furl file into the node's base direct=
ory<br>
(next to the tahoe.cfg file): it will be interpreted in the same way.<br>
<br>
Once running, the stats gatherer will create a standard python "pickle=
" file<br>
in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats<br>
information from every connected node and write them into the pickle. The<b=
r>
pickle will contain a dictionary, in which node identifiers (known as "=
;tubid"<br>
strings) are the keys, and the values are a dict with 'timestamp',<=
br>
'nickname', and 'stats' keys. d[tubid][stats] will contain =
the stats<br>
dictionary as made available at <a href=3D"http://localhost:3456/statistics=
?t=3Djson" target=3D"_blank">http://localhost:3456/statistics?t=3Djson</a> =
. The<br>
pickle file will only contain the most recent update from each node.<br>
<br>
Other tools can be built to examine these stats and render them into<br>
something useful. For example, a tool could sum the<br>
"storage_server.disk_avail' values from all servers to compute a<b=
r>
total-disk-available number for the entire grid (however, the "disk wa=
tcher"<br>
daemon, in misc/spacetime/, is better suited for this specific task).<br>
<br>
=3D Using Munin To Graph Stats Values =3D<br>
<br>
The misc/munin/ directory contains various plugins to graph stats for Tahoe=
<br>
nodes. They are intended for use with the Munin system-management tool, whi=
ch<br>
typically polls target systems every 5 minutes and produces a web page with=
<br>
graphs of various things over multiple time scales (last hour, last month,<=
br>
last year).<br>
<br>
Most of the plugins are designed to pull stats from a single Tahoe node, an=
d<br>
are configured with the <a href=3D"http://localhost:3456/statistics?t=3Djso=
n" target=3D"_blank">http://localhost:3456/statistics?t=3Djson</a> URL. The=
<br>
"tahoe_stats" plugin is designed to read from the pickle file cre=
ated by the<br>
stats-gatherer. Some are to be used with the disk watcher, and a few (like<=
br>
tahoe_nodememory) are designed to watch the node processes directly (and mu=
st<br>
therefore run on the same host as the target node).<br>
<br>
Please see the docstrings at the beginning of each plugin for details, and<=
br>
the "tahoe-conf" file for notes about configuration and installin=
g these<br>
plugins into a Munin environment.<br>
<br>_______________________________________________<br>
tahoe-dev mailing list<br>
<a href=3D"mailto:tahoe-dev at allmydata.org">tahoe-dev at allmydata.org</a><br>
<a href=3D"http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev" target=
=3D"_blank">http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev</a><br>
<br></blockquote></div><br></div>
--000e0cd6a73a83d90e047b5ef099--
More information about the tahoe-dev
mailing list