Splunk Completes Acquisition of Plumbr Learn more

To blog |

The Sense And Sensibilities Of Bug Triage

December 12, 2018 by Ivo Mägi Filed under: Plumbr

When I began to program professionally, meetings were never something I’d look forward to. Of all the meetings we had, the Bug Triage was the one I resented the most. As an entry-level programmer, my attitude was one of “assign it to me and I will take care of it”. I was blissfully unaware of the normal or nuanced procedures that made up a good bug-fixing practice. Over time, I did mature and warm up to meetings. I began to appreciate their importance, got better at managing my time, and learned to collaborate better. About ten years have passed since. I have also moved about ten steps away from regularly contributing to production code. However, the occasional Bug Triage meeting I have to participate in still bothers me.

For one, there is no clear distinction between Issues, Bugs, Defects, Errors, Faults, and Failures. At these meetings, we are shown an email that came from a user reporting a snag in live software. There is a clamour and a long deliberation over whether it is a bug, or an error. I can neither sense not appreciate the distinction. A textbook[1] on Software Testing attempted to define this. I opine that these definitions are ‘murky’ at best.

Evidence of errors is a ticket raised by users. On occasions when we’re lucky, we have answers to questions like “What browser were they using?”, “What type of device was the user on?”, “What Internet quality did they have while accessing?”. There is very little to tell about product variation, breadcrumbs in sessions, usage patterns, or other details to supply as evidence vectors.

Another attempt while discussing errors are the attempts to prioritize them as High, Medium, Low and Now. What data do we use to make a selection? I have been given three important parameters: Intuition, Intrinsic Insight, and Inclination. We could well be drawing straws to decide and still have similar outcomes.

Then comes the dimension of Severity. The options are Critical, Major, Moderate, and Minor. With no guarantee of being able to reproduce the error, lack of user context, negligible traces, no aggregated metrics about how many users this is affecting, and nominal investigation into the root cause, stakeholders around the table are expected to attach a degree of importance to an error report.

Why are priorities distinct from severity? What about complexity of resolution? Is there scope for downstream dependencies? The whole process of triaging issues requires a profound mastery of the software system. It requires functional knowledge of the relevant systems, subsystems, and integrated interdependencies. The outcome from all efforts to triage results in a shoddy roadmap, cobbled together to serve a process, rather than a purpose.  

My conviction for Plumbr doing exceedingly well in the market comes from the ability of the product to address all of these concerns. Plumbr is a culmination of efforts from engineers who have married technology and technique to mitigate the uncertainty during bug triage.

Every data dimension required to verify an error is recorded and made available. Errors due to failed browser handshakes, memory issues, uncaught JavaScript errors, script errors from third party domains are all faithfully captured from the application. User information such as browser type, version, geo-location, internet class, and other contextual information is recorded. Root causes are exposed as call traces when available.

Imagine a mature process where every developer had access to information about the exact experience of a user. They can correlate and extract causation for errors arising from the code that they contributed. Any error reported by users becomes a trivial lookup to a system that recorded failures in user interactions. Managers and Business stakeholders had exact metrics about how latency and availability actually affect end users. There is no need for pre-meetings, meetings, and post-meetings to triage issues. A transparent representation of issues occurring on production is available in real-time. The list can be sorted by frequency, or arranged by degree of impact.

[1] Dorothy Graham, Erik Van Veenendaal, Isabel Evans; Foundations of Software Testing: ISTQB Certification; Gardners Books 2008; ISBN 9781844809899

Thumbnail courtesy Smashicons from www.flaticon.com is licensed by CC 3.0 BY