[tahoe-lafs-trac-stream] [Tahoe-LAFS] #905: gather information about historical server performance

Thu Oct 29 02:00:16 UTC 2015

#905: gather information about historical server performance
------------------------------+------------------------------------
     Reporter:  warner        |      Owner:
         Type:  enhancement   |     Status:  new
     Priority:  major         |  Milestone:  undecided
    Component:  code-network  |    Version:  1.5.0
   Resolution:                |   Keywords:  performance statistics
Launchpad Bug:                |
------------------------------+------------------------------------
Changes (by lpirl):

 * cc: tahoe-lafs.org@… (added)

Old description:

> While patiently uploading some relatively small (3MB) files to the
> testgrid,
> I found myself wishing for more specific information about how long each
> server was taking to respond. Some part of me wanted to blame specific
> servers for being slow, but I realize that my upstream bandwidth is
> limited,
> and I'd like to compare the upload time of each segment against the
> minimum
> possible upload time given my available bandwidth. So I want more
> information
> about server performance.
>
> The download staus web page (via "recent uploads and downloads") reports
> per-server per-segment fetch response times as a big list of numbers.
> It's
> possible to eyeball this and look for trends, for example in the
> following
> download:
>
> * Per-Server Segment Fetch Response Times:
>     * {{{hk475awa}}}: 1.56s, 876ms, 714ms, 705ms, 545ms, 545ms, 539ms,
> 579ms, 392ms, 386ms, 377ms, 371ms, 372ms, 387ms, 555ms, 388ms, 373ms,
> 373ms, 372ms, 372ms, 370ms, 373ms, 370ms, 370ms
>     * {{{lwkv6cji}}}: 160ms, 103ms, 105ms, 106ms, 100ms, 119ms, 117ms,
> 105ms, 106ms, 109ms, 102ms, 107ms, 103ms, 1.02s, 91ms, 93ms, 90ms, 93ms,
> 90ms, 92ms, 89ms, 93ms, 89ms, 80ms
>     * {{{2gn6njsm}}}: 763ms, 536ms, 520ms, 537ms, 537ms, 548ms, 537ms,
> 534ms, 527ms, 535ms, 528ms, 525ms, 628ms, 532ms, 541ms, 528ms, 530ms,
> 528ms, 534ms, 527ms, 528ms, 527ms, 529ms, 448ms
>
> you can tell that {{{lwkv6cji}}} was pretty quick, and that
> {{{2gn6njsm}}}
> was 5-6x slower.
>
> I'd like three improvements on this:
>
>  * collect data on uploads, not just downloads
>  * display data in some sort of graphical form, to make it easier to spot
>    trends. Maybe segnum should be the X axis, reponse time is the Y axis,
>    serverid is color, and each sample should be drawn as a dot. Servers
> which
>    gave consistent service would show up as horizontal stripes. Common-
> mode
>    delays would appear as spikes.
>  * connect+display data across multiple uploads and downloads. This would
>    involve storing some history about each server, maybe in a sqlite
> database
>    or something.
>
> I've also been thinking about this in the context of a new Downloader
> (#287),
> in which I want to sort potential servers according to their likely
> download
> speeds (favoring fast servers, but experimenting with slower ones to
> spread
> the load and learn about alternatives). Mainly I want the Downloader to
> be
> able to tell the difference between a server that's running at normal
> speed,
> one that's running unusually slowly, and one that's disconnected, and I
> think
> historical data will help here.
>
> Also, with historical data, we might be able to deduce each server's
> local
> upload/download speed limits by looking for a minimum reponse time for
> given-sized messages. We might use this to build up a list of download
> servers with an aggregate bandwidth that matches our own download
> bandwidth
> (exactly filling all the pipes), or to influence share placement during
> upload.
>
> Another challenge is that currently the only data we get is response time
> for
> each request, which combines outbound request size (and contention for
> the
> wire), server response turnaround time (i.e. CPU load, disk IO latency,
> etc),
> and response size (and wire contention). In some ways, we want to be able
> to
> distinguish between those components. In other ways, we really only care
> about the sum.
>
> But mainly, distinguishing between "slow" and "disconnected" may work
> better
> if we had finer-grained information, like how many bytes have arrived at
> the
> socket since the last complete message was received, and how long ago the
> last byte arrived. Likewise, when sending data, it would be useful to
> know
> that the transmit buffer still has pending data, and how long it's been
> since
> the last socket write was allowed. This would help tell the difference
> between a connection that is alive but only trickling data through
> slowly,
> versus one that has been stopped for the last 10 seconds. With only
> coarser
> per-message data (which, for 3-of-10 and 128KiB segments, means about
> 40kB
> per message), it might be hard to confidently declare a disconnect within
> a
> reasonable multiple of the expected reponse time.
>
> I don't know exactly what sorts of data I'd want or how much of it to
> keep.
> This ticket is to collect ideas on what to collect and what to do with
> it.
> The first concrete improvement would be to record per-request response
> times
> for upload just like we do for download. The second would be to graph
> them.

New description:

 While patiently uploading some relatively small (3MB) files to the
 testgrid,
 I found myself wishing for more specific information about how long each
 server was taking to respond. Some part of me wanted to blame specific
 servers for being slow, but I realize that my upstream bandwidth is
 limited,
 and I'd like to compare the upload time of each segment against the
 minimum
 possible upload time given my available bandwidth. So I want more
 information
 about server performance.

 The download staus web page (via "recent uploads and downloads") reports
 per-server per-segment fetch response times as a big list of numbers. It's
 possible to eyeball this and look for trends, for example in the following
 download:

 * Per-Server Segment Fetch Response Times:
     * {{{hk475awa}}}: 1.56s, 876ms, 714ms, 705ms, 545ms, 545ms, 539ms,
 579ms, 392ms, 386ms, 377ms, 371ms, 372ms, 387ms, 555ms, 388ms, 373ms,
 373ms, 372ms, 372ms, 370ms, 373ms, 370ms, 370ms
     * {{{lwkv6cji}}}: 160ms, 103ms, 105ms, 106ms, 100ms, 119ms, 117ms,
 105ms, 106ms, 109ms, 102ms, 107ms, 103ms, 1.02s, 91ms, 93ms, 90ms, 93ms,
 90ms, 92ms, 89ms, 93ms, 89ms, 80ms
     * {{{2gn6njsm}}}: 763ms, 536ms, 520ms, 537ms, 537ms, 548ms, 537ms,
 534ms, 527ms, 535ms, 528ms, 525ms, 628ms, 532ms, 541ms, 528ms, 530ms,
 528ms, 534ms, 527ms, 528ms, 527ms, 529ms, 448ms

 you can tell that {{{lwkv6cji}}} was pretty quick, and that {{{2gn6njsm}}}
 was 5-6x slower.

 I'd like three improvements on this:

  * collect data on uploads, not just downloads
  * display data in some sort of graphical form, to make it easier to spot
    trends. Maybe segnum should be the X axis, reponse time is the Y axis,
    serverid is color, and each sample should be drawn as a dot. Servers
 which
    gave consistent service would show up as horizontal stripes. Common-
 mode
    delays would appear as spikes.
  * connect+display data across multiple uploads and downloads. This would
    involve storing some history about each server, maybe in a sqlite
 database
    or something.

 I've also been thinking about this in the context of a new Downloader
 (#287),
 in which I want to sort potential servers according to their likely
 download
 speeds (favoring fast servers, but experimenting with slower ones to
 spread
 the load and learn about alternatives). Mainly I want the Downloader to be
 able to tell the difference between a server that's running at normal
 speed,
 one that's running unusually slowly, and one that's disconnected, and I
 think
 historical data will help here.

 Also, with historical data, we might be able to deduce each server's local
 upload/download speed limits by looking for a minimum reponse time for
 given-sized messages. We might use this to build up a list of download
 servers with an aggregate bandwidth that matches our own download
 bandwidth
 (exactly filling all the pipes), or to influence share placement during
 upload.

 Another challenge is that currently the only data we get is response time
 for
 each request, which combines outbound request size (and contention for the
 wire), server response turnaround time (i.e. CPU load, disk IO latency,
 etc),
 and response size (and wire contention). In some ways, we want to be able
 to
 distinguish between those components. In other ways, we really only care
 about the sum.

 But mainly, distinguishing between "slow" and "disconnected" may work
 better
 if we had finer-grained information, like how many bytes have arrived at
 the
 socket since the last complete message was received, and how long ago the
 last byte arrived. Likewise, when sending data, it would be useful to know
 that the transmit buffer still has pending data, and how long it's been
 since
 the last socket write was allowed. This would help tell the difference
 between a connection that is alive but only trickling data through slowly,
 versus one that has been stopped for the last 10 seconds. With only
 coarser
 per-message data (which, for 3-of-10 and 128KiB segments, means about 40kB
 per message), it might be hard to confidently declare a disconnect within
 a
 reasonable multiple of the expected reponse time.

 I don't know exactly what sorts of data I'd want or how much of it to
 keep.
 This ticket is to collect ideas on what to collect and what to do with it.
 The first concrete improvement would be to record per-request response
 times
 for upload just like we do for download. The second would be to graph
 them.

--

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/905#comment:2>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage