#484 new enhancement

client feedback channel

Reported by: warner Owned by: somebody
Priority: major Milestone: undecided
Component: operational Version: 1.1.0
Keywords: performance statistics logging Cc:
Launchpad Bug:

Description (last modified by zooko)

It would be nice if clients had a way to report errors and performance results to a central gatherer. This would be configured by dropping a "client-feedback.furl" file into the client's basedir. The client would then use this to send the following information to a gatherer at that FURL:

  • foolscap log "Incidents": severe errors, along with the log events that immediately preceded them
  • speeds/latencies of each network operation: upload/download performance

The issue is that, since Tahoe is such a resilient system, there are a lot of failures modes that wouldn't be visible to users. If a client reads a share and sees the wrong hash, it just uses a different one, and users don't see any problems unless there are many simultaneous failures. However, from the grid-admin/server point of view, a bad hash is a massively unlikely event and indicates serious problems: a disk is failing, a server has bad RAM, a file is corrupted, etc. The server/grid admin wants to know about these, even though the user does not.

Similarly, there are a number of grid tuning issues that are best addressed by learning about the client experience, and watching them change over time. When you add new servers to a grid, clients who ask all servers for their shares will take longer to do peer selection. How much longer? The best way to find out is to have the client report their peer-selection time to a gatherer on each operation, and then the gatherer can graph them over time. The grid admins might want to make their servers faster if it would improve user experience, but they need to find out what the user experience is before then can make that decision.

We might want to make these two separate reporting channels, with separate FURLs. Also, we can batch the reporting of the performance numbers: we don't have to report every single operation. We could cut down on active network connections by only trying to connect once a day and dumping incidents if and when we establish a connection. We need to keep several issues in mind: thundering herd, overloading the gatherer, bounding the queue size.

Change History (4)

comment:1 Changed at 2009-12-04T05:00:04Z by davidsarah

  • Component changed from code-performance to code-frontend-cli
  • Keywords performance added

comment:2 Changed at 2009-12-13T05:10:57Z by davidsarah

  • Keywords statistics logging added

comment:3 Changed at 2010-01-27T22:11:04Z by warner

  • Component changed from code-frontend-cli to operational
  • Owner set to somebody

comment:4 Changed at 2013-07-03T16:58:09Z by zooko

  • Description modified (diff)

Does the log-gatherer satisfy this ticket?

Note: See TracTickets for help on using tickets.