
How Plumbr will help you
Plumbr is capable of monitoring for a large and growing amount root causes. Already in 2016 we were able to link more than 80% of the distributed traces with the explicit root cause.
The real world unfortunately contains unlimited number of ways an application can perform poorly. A fallback is needed to cover cases where the specific root cause can not be determined. In such situations, Plumbr Java Agent will end up taking stack traces periodically from threads servicing slow API calls and aggregate collected traces into easy-to-digest format, similar to the example above. Collected traces will be grouped by the service and the jvm/cluster they were captured in.
The example above shows that during the monitored period, there were 6,984 stack traces captured in the /struts/doReceive endpoint in production-cluster cluster where the root cause for was not explicitly detected, but the API call into this endpoint lasted over a second.
Plumbr ended up capturing stack traces from all such API calls, aggregating the tree-like structure visible in the example above. Call stacks occurring most frequently are ranked higher in a tree.
The Solution
The solution for the request processing root causes involves extracting the information from the composed call stack tree. Your focus should be first on the top branches of the tree, where the most frequently captured stack traces are aggregated. If a stack trace is taken frequently, then when the sample size is statistically relevant, you can be certain that this is the source for your performance issues.
For the example above it is immediately clear that 92% or around 6,400 stack traces were taken from
org.apache.commons.fileupload.util.Streams.copy():96 java.io.InputStream.read():82 org.apache.commons.fileupload.MultipartStream$ItemInputStream.read():886 org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable():976 ... org.apache.coyote.http11.InternalInputBuffer$InputStreamInputBuffer.doRead():781 org.apache.coyote.http11.InternalInputBuffer.fill():751 java.net.SocketInputStream.read():129
Effectively meaning that the file upload operation using Apache Commons was been carried out in 6,400 cases of the total of 6,984 traces taken from the /struts/doReceive endpoint. This is definitely large enough of a sample size to claim it to be the actual root cause.
This was indeed the case and the solution involved both capping the limit on file upload and introducing partitioned SSD-based storage for faster upload processing, effectively removing the issue at hand.