To blog Previous post | Next post
Outage == Outrage
Let’s own up – we’ve all been part of teams that have faced failures and/or slow performance in web applications. The Google SRE book highlights a very common occurrence in many organizations. There are many stories like The Phoenix Project across the internet. As a (former) engineer, I’ve faced tough times during school season (edu apps), and when my team migrated to a new PaaS provider. Agitated users, random escalations from customer facing teams, and uncomfortable discussions with the “boss” are the norm when it comes to dealing with issues. Evidence about failure is met with a shrug of the shoulder, and triage seems weary.
The common denominator to trouble areas in Incident Management among everyone is:
- How are my users feeling about this incident?
- Where do I begin the process of resolution?
Our new product direction stems from this. The organization as a whole, is responsible for a user to be able to complete a transaction, and have a pleasant experience doing so. Engineers, Product owners, and support systems all come together under an abstracted “interface” that helps a user accomplish their goal. When there are multiple subsystems involved in a given flow, attributing failures and assigning responsibilities become tricky.
Part I: Identifying the problem as experienced by a user
Too much of anything is a bane. No statement is more true, especially with regard to data. There is so much data available right now that absorbing it gets overwhelming very quickly. Memory footprints, disk space usage, server resources, network resources, logs, CPU metrics, spikes, traffic, power, are some of these data points. One major disconnect – these are aggregate metrics from the server-side, and not directly attributable to individual users.
With Plumbr, we decided to monitor data about usage from the browser. We took the common approach engineers take (which is sifting through logs, and extracting correlation to determine the cause of failures) and encapsulate them into sessions. The sessions, and consequently interactions, are mapped to failures, or bottlenecks.
Part II: Details about the chase
Solving issues, and improving performance of an app isn’t about email triggers, or firing off an IRC (or Slack) notification when something is down. By associating failures with specific lines in the code, the solution space shrinks to a trivial one. Engineers immediately gain the context needed to identify fixes. The impact of each error or bottleneck is also captured and displayed. This allows engineering teams to prioritize fixes.
Plumbr also helps the overall software engineering process mature. When attached to production, it brings a degree of transparency that helps plan and prioritize the development roadmap. Plumbr helps define thresholds for acceptable performance levels, and ensuring that future development adheres to these goals, preventing applications from getting sluggish and unresponsive. When attached to UAT environments, Plumbr can help avoid bad artefacts from reaching production ones. We also envision that Plumbr can benefit the transactions between vendors, contractors, and outsourcing agencies by facilitating the creations of SLAs and SLOs. Plumbr also helps establish a positive feedback loop which acknowledges fixes, and improvements to the quality of an application.
And finally, Happy Thanksgiving everyone! To all those who are building e-commerce apps – here’s to scaling gracefully, and your apps holding up well on Black Friday 🙂
Thank you Gleb S, and Nikita for helping revise drafts of this.
Managing Incidents (Chapter 14) ,
Site Reliability Engineering: How Google Runs Production Systems, O’Reilly Media, Incorporated, 2016, ISBN 9781491929124
The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win 1st
IT Revolution Press ©2013