Transaction snapshots become request processing
Today, we are happy to announce an update that completes the automated root cause analysis feature within the Plumbr Java Agent. From now on, Plumbr can automatically explain every bottleneck in your Java code that makes your application or API slow. This means that we can relieve software engineers of enormous amounts of overhead created by the need to troubleshoot and reproduce issues and instead allow them to directly proceed to fixing them.
As you might know, automated root cause detection is one Plumbr’s unique features. We detect many types of root causes in Java applications – for example, slow database queries, connection retrieval delays, locked threads, delays caused by garbage collection, and many others.
We’ve been developing this feature over the last years and released support for these different types of root causes one by one. In order to decide the priority of which specific root causes to support, our engineering team used what we call transaction snapshots to analyze which types of problems are most common among our customers.
Our transaction snapshots are essentially stack traces that are taken periodically from a thread processing a user request if there is no specific root cause already detected. For our engineering team, snapshots have always been invaluable sources of information. For our users, however, snapshots from all monitored applications were dumped in an anonymous pile with a root cause type of “Uncategorized” and without any reasonable associated impact. Until last week they used to look like this:
Experienced (or, mostly, lucky) users could sometimes find real gems among those snapshots and pin down complex performance issues. But for most they were pretty much useless.
However, we are happy to announce that we have completely overhauled how we present transaction snapshots, and they now look like this:
The Processing Request bottlenecks are further grouped by the service name and cluster/jvm name. This allows us to clearly distinguish services needing most attention from services having rare hiccups.
But the most important change is that from now on these bottlenecks will display user/API impact. By default, the impact of the Processing Request bottleneck is equal to the duration of a span, where snapshots were taken:
However, a common pattern in modern distributed applications is to dispatch a request to a remote service (or even multiple services) in a separate thread(s), so that the main thread could await the response but is also able to abort the request on timeout. Previously, transaction snapshots were reported as bottlenecks even if the thread, where the snapshot was taken, was just waiting for the completion of such asynchronous operation. Often, at the same time another distinct root cause was reported as a bottleneck from the asynchronous operation, which made the thread with snapshots irrelevant from the performance tuning stand point.
The new Processing Request bottlenecks account for durations of both synchronous and asynchronous operations that were initiated by the thread. As a result, only true impact from the original thread remains, which means that you can accurately aggregate the impact and compare it to that of other bottlenecks in your application.
What does this mean to our customers? With all code bottlenecks being automatically discovered and their impact accounted for by Plumbr, your engineering teams can now always be sure that they address the ones that impact the end users the most. As a result, the efficiency of the team increases and as most serious bottlenecks get resolved first, also the end users get a better service.
If you want to know more about the new Processing Request bottleneck, see here. For a more general overview of different root causes that Plumbr detects, click here. Finally, if some of your questions are not answered on these linked pages, you can always write to firstname.lastname@example.org and we’ll be happy to help.