Troubleshooting system performance with time series scatter plots

The use of time series scatter plots can provide far more insight into system performance when compared to using typical average latencies and cumulative throughput rates. These scatter plots allow you to see multiple populations of transactions that may not be apparent when using summary statistics. Here’s an example:

The x-axis is time and the y-axis is time (log scale). Each point is a specific transaction executed against an application server backed by a a database on another system. This is an OLTP style application with relatively short (sub second) transaction times. The client was experiencing elevated average transaction times with spikes in latencies of up to 60+ seconds. In this case, a full scatter plot analysis clearly illustrates gaps in transactional processing.

Further instrumentation identified the source of the latency spikes to be blocking on context pools in a service layer. A harness was used to bypass this service layer and two tests were run, one with the service layer (gray line) and one without (tan line) illustrated in the following plot:

This simple plot of throughput (y-axis) versus test duration (x-axis) illustrates the drop in throughput when the service layer was in play. The use of the full scatter plot analysis coupled with detailed latency fencing in the transactional stack led to the eventual resolution of this issue.