Feature focus: Demoting errors and Transaction snapshots
Building a product is hard. Perhaps the biggest challenge in building a product is to think through all the different permutations in which a user would interact with the app. To be able to anticipate the expectation of the user and fulfill it, is what separates great products from good ones. Here are a couple feature highlights which we have baked into Plumbr, based on where we anticipated our users would step off the happy path.
Feature #1: Demoting errors
The vision of Plumbr is to make software more reliable for users. We achieve this is by monitoring application usage, detecting failures, communicating it back to engineering teams, and acknowledging when errors have been fixed. Plumbr observes traffic, and flags errors by checking response codes.
A binary classifier typically has four outcomes – a hit, a miss, a false alarm, and rejection. For Plumbr, this manifests as either errors or success. Ambiguity arises because of the possibility of denoting errors just on the basis of response codes. Owing to engineering decisions, there could be some transactions classified as errors, when they are actually not. The reason – custom response codes. There are cases when a user interaction in an application that we monitor is not handled correctly technically, but it’s still considered a success from the end user’s point of view
Typically, web applications are built following the IANA standard for response codes. Plumbr too is built upon this premise. However, many business applications use custom codes or interpret IANA response codes with variation. to denote certain events. These are flagged by Plumbr as errors, because they don’t conform to the IANA standard of success codes.
Plumbr’s users have the option to flag these false positives, upon their conviction that these don’t impact the end-user’s experience (assuming an error detection is a positive outcome). This way, these errors are prevented from appearing in the list of errors captured, and also aren’t aggregated in the dashboards. In the product we call this Demoting errors.
For purposes of reporting, classification, and separation of concerns, we’ve seen this feature being used. One specific case reported back to us was that one team was responsible for a web-service, and another for authentication. They wanted to suppress all the errors thrown by the authentication service from appearing on their dashboards. They simply demoted the error, and they had a much clearer picture from their Plumbr dashboard.
Feature #2: Transaction snapshots
One of the major benefits provided by Plumbr is the capturing of call traces that run many levels deep. Periodically, Plumbr also captures thread dumps. By combining these, Plumbr has all the information required to ascertain the cause of failures and bottlenecks.
On occassion, interfacing with legacy applications could cause of performance bottlenecks. These codebases usually integrate using old remote method invocation mechanisms. This could result in Plumbr not having sufficient contextual information to provide sophisticated support as other methods. This constraint can be overcome by employing a fallback mechanism known as Transaction snapshots.
Plumbr periodically captures thread dumps from the environment that is executing the transaction. These snapshots are spawned at increasing intervals, and are purged if the transaction neither fails not is slow or stuck. Plumbr adds value by assimilating the call stacks from the thread dumps into a tree-structure. The branches are ordered by frequency of occurrence.
This way, there is enough information available for engineers to begin fixing issues. A screenshot follows, which shows the same kind of contextual information like other bottlenecks.
There is further investigation we’re doing about how we can improve classification. We’re also planning on integrating deeply with many more tech stacks like Node.js, php, Django, .NET and others. Watch out for these and many more configuration options in future releases.