[tahoe-lafs-trac-stream] [tahoe-lafs] #484: client feedback channel
tahoe-lafs
trac at tahoe-lafs.org
Wed Jul 3 16:58:09 UTC 2013
#484: client feedback channel
-----------------------------+--------------------------------------------
Reporter: warner | Owner: somebody
Type: enhancement | Status: new
Priority: major | Milestone: undecided
Component: operational | Version: 1.1.0
Resolution: | Keywords: performance statistics logging
Launchpad Bug: |
-----------------------------+--------------------------------------------
Old description:
> It would be nice if clients had a way to report errors and performance
> results to a central gatherer. This would be configured by dropping a
> "client-feedback.furl" file into the client's basedir. The client would
> then use this to send the following information to a gatherer at that
> FURL:
>
> * foolscap log "Incidents": severe errors, along with the log events
> that immediately preceded them
> * speeds/latencies of each network operation: upload/download
> performance
>
> The issue is that, since Tahoe is such a resilient system, there are a
> lot of failures modes that wouldn't be visible to users. If a client
> reads a share and sees the wrong hash, it just uses a different one, and
> users don't see any problems unless there are many simultaneous failures.
> However, from the grid-admin/server point of view, a bad hash is a
> massively unlikely event and indicates serious problems: a disk is
> failing, a server has bad RAM, a file is corrupted, etc. The server/grid
> admin wants to know about these, even though the user does not.
>
> Similarly, there are a number of grid tuning issues that are best
> addressed by learning about the client experience, and watching them
> change over time. When you add new servers to a grid, clients who ask all
> servers for their shares will take longer to do peer selection. How much
> longer? The best way to find out is to have the client report their peer-
> selection time to a gatherer on each operation, and then the gatherer can
> graph them over time. The grid admins might want to make their servers
> faster if it would improve user experience, but they need to find out
> what the user experience is before then can make that decision.
>
> We might want to make these two separate reporting channels, with
> separate FURLs. Also, we can batch the reporting of the performance
> numbers: we don't have to report every single operation. We could cut
> down on active network connections by only trying to connect once a day
> and dumping incidents if and when we establish a connection. We need to
> keep several issues in mind: thundering herd, overloading the gatherer,
> bounding the queue size.
New description:
It would be nice if clients had a way to report errors and performance
results to a central gatherer. This would be configured by dropping a
"client-feedback.furl" file into the client's basedir. The client would
then use this to send the following information to a gatherer at that
FURL:
* foolscap log "Incidents": severe errors, along with the log events that
immediately preceded them
* speeds/latencies of each network operation: upload/download performance
The issue is that, since Tahoe is such a resilient system, there are a lot
of failures modes that wouldn't be visible to users. If a client reads a
share and sees the wrong hash, it just uses a different one, and users
don't see any problems unless there are many simultaneous failures.
However, from the grid-admin/server point of view, a bad hash is a
massively unlikely event and indicates serious problems: a disk is
failing, a server has bad RAM, a file is corrupted, etc. The server/grid
admin wants to know about these, even though the user does not.
Similarly, there are a number of grid tuning issues that are best
addressed by learning about the client experience, and watching them
change over time. When you add new servers to a grid, clients who ask all
servers for their shares will take longer to do peer selection. How much
longer? The best way to find out is to have the client report their peer-
selection time to a gatherer on each operation, and then the gatherer can
graph them over time. The grid admins might want to make their servers
faster if it would improve user experience, but they need to find out what
the user experience is before then can make that decision.
We might want to make these two separate reporting channels, with separate
FURLs. Also, we can batch the reporting of the performance numbers: we
don't have to report every single operation. We could cut down on active
network connections by only trying to connect once a day and dumping
incidents if and when we establish a connection. We need to keep several
issues in mind: thundering herd, overloading the gatherer, bounding the
queue size.
--
Comment (by zooko):
Does
[source:trunk/docs/logging.rst?rev=861892983369c0e96dc1e73420c1d9609724d752
#log-gatherer the log-gatherer] satisfy this ticket?
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/484#comment:4>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list