[tahoe-lafs-trac-stream] [tahoe-lafs] #484: client feedback channel

Wed Jul 3 16:58:09 UTC 2013

#484: client feedback channel
-----------------------------+--------------------------------------------
     Reporter:  warner       |      Owner:  somebody
         Type:  enhancement  |     Status:  new
     Priority:  major        |  Milestone:  undecided
    Component:  operational  |    Version:  1.1.0
   Resolution:               |   Keywords:  performance statistics logging
Launchpad Bug:               |
-----------------------------+--------------------------------------------

Old description:

> It would be nice if clients had a way to report errors and performance
> results to a central gatherer. This would be configured by dropping a
> "client-feedback.furl" file into the client's basedir. The client would
> then use this to send the following information to a gatherer at that
> FURL:
>
>  * foolscap log "Incidents": severe errors, along with the log events
> that immediately preceded them
>  * speeds/latencies of each network operation: upload/download
> performance
>
> The issue is that, since Tahoe is such a resilient system, there are a
> lot of failures modes that wouldn't be visible to users. If a client
> reads a share and sees the wrong hash, it just uses a different one, and
> users don't see any problems unless there are many simultaneous failures.
> However, from the grid-admin/server point of view, a bad hash is a
> massively unlikely event and indicates serious problems: a disk is
> failing, a server has bad RAM, a file is corrupted, etc. The server/grid
> admin wants to know about these, even though the user does not.
>
> Similarly, there are a number of grid tuning issues that are best
> addressed by learning about the client experience, and watching them
> change over time. When you add new servers to a grid, clients who ask all
> servers for their shares will take longer to do peer selection. How much
> longer? The best way to find out is to have the client report their peer-
> selection time to a gatherer on each operation, and then the gatherer can
> graph them over time. The grid admins might want to make their servers
> faster if it would improve user experience, but they need to find out
> what the user experience is before then can make that decision.
>
> We might want to make these two separate reporting channels, with
> separate FURLs. Also, we can batch the reporting of the performance
> numbers: we don't have to report every single operation. We could cut
> down on active network connections by only trying to connect once a day
> and dumping incidents if and when we establish a connection. We need to
> keep several issues in mind: thundering herd, overloading the gatherer,
> bounding the queue size.

New description:

 It would be nice if clients had a way to report errors and performance
 results to a central gatherer. This would be configured by dropping a
 "client-feedback.furl" file into the client's basedir. The client would
 then use this to send the following information to a gatherer at that
 FURL:

  * foolscap log "Incidents": severe errors, along with the log events that
 immediately preceded them
  * speeds/latencies of each network operation: upload/download performance

 The issue is that, since Tahoe is such a resilient system, there are a lot
 of failures modes that wouldn't be visible to users. If a client reads a
 share and sees the wrong hash, it just uses a different one, and users
 don't see any problems unless there are many simultaneous failures.
 However, from the grid-admin/server point of view, a bad hash is a
 massively unlikely event and indicates serious problems: a disk is
 failing, a server has bad RAM, a file is corrupted, etc. The server/grid
 admin wants to know about these, even though the user does not.

 Similarly, there are a number of grid tuning issues that are best
 addressed by learning about the client experience, and watching them
 change over time. When you add new servers to a grid, clients who ask all
 servers for their shares will take longer to do peer selection. How much
 longer? The best way to find out is to have the client report their peer-
 selection time to a gatherer on each operation, and then the gatherer can
 graph them over time. The grid admins might want to make their servers
 faster if it would improve user experience, but they need to find out what
 the user experience is before then can make that decision.

 We might want to make these two separate reporting channels, with separate
 FURLs. Also, we can batch the reporting of the performance numbers: we
 don't have to report every single operation. We could cut down on active
 network connections by only trying to connect once a day and dumping
 incidents if and when we establish a connection. We need to keep several
 issues in mind: thundering herd, overloading the gatherer, bounding the
 queue size.

--

Comment (by zooko):

 Does
 [source:trunk/docs/logging.rst?rev=861892983369c0e96dc1e73420c1d9609724d752
 #log-gatherer the log-gatherer] satisfy this ticket?

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/484#comment:4>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage