How Plumbr will help you
Plumbr is capable of monitoring for a large and growing amount root causes. In August 2016, we were able to link more than 80% of the transactions with the explicit root cause.
The real world unfortunately contains unlimited number of ways a application can perform poorly. Thus a fallback was needed to cover the cases where the specific root cause can not be determined. In such situations, Plumbr Agent will end up taking thread dump(s) from the threads servicing slow transactions and aggregates the dumps into easy-to-digest format, similar to the example above.
The example above shows that during the monitored period, there were 50,535 slow transactions across all the JVMs in this account where the root cause for transactions consuming /struts/doReceive endpoint was not explicitly detected.
Plumbr ended up capturing snapshots from all such transactions, aggregating the tree-like structure visible in the example above. Call stacks occurring most frequently are ranked higher in a tree.
In addition, Plumbr exposes the more detailed view using latency distribution chart. The following chart is extracted from the same example, where the 60,535 transactions were slower than expected. How slow, is best seen from the following chart:
As seen from the chart above, the median response time for the transactions consuming /struts/doReceive service was just above 11 seconds, but for 10% or around 6,000 transactions it took more than 40 seconds to complete. On worst case, the user had to wait for more than half an hour.
The solution for the root causes including transaction snapshots involves extracting the information from the call stack tree composed. Your focus should be first on the top branches on the tree, where the most frequently captured snapshots are aggregated. If a snapshot is taken frequently, then when the sample size is statistically relevant, you can be certain that this is the source for your performance issues.
For example above it is immediately clear that 92% or around 55,000 transactions the snapshot was taken from
org.apache.commons.fileupload.util.Streams.copy():96 java.io.InputStream.read():82 org.apache.commons.fileupload.MultipartStream$ItemInputStream.read():886 org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable():976 ... org.apache.coyote.http11.InternalInputBuffer$InputStreamInputBuffer.doRead():781 org.apache.coyote.http11.InternalInputBuffer.fill():751 java.net.SocketInputStream.read():129
Effectively meaning that the file upload operation using Apache Commons was been carried out in 55,000 of the total of 60,535 transactions. This is definitely large enough of a sample size to claim it to be the actual root cause.
This was indeed the case and the solution involved both capping the limit on file upload and introducing partitioned SSD-based storage for faster upload processing, effectively removing the issue at hand.