To blog

Meaningful and Contextual Alerting for DevOps Teams

June 6, 2019 by Ram Iyengar Filed under: Blog DevOps Performance Plumbr

Paging an engineer is built on a foundation where –

  1. A page must be sent when there is a known/impending SLO violation
  2. A page sent to an engineer must be responded to with utmost urgency.
  3. The information included in a page must provide a good starting point for troubleshooting.

Pages are sent to engineers when there is an alert from the system that needs their attention for resolution. There are then two main factors upon which these pages, and therefore alerts, must be based on. We live in a client-server world. Incidents, therefore, can originate in the client or server realms. Consequently, alerting is built upon a “client” perspective or “server” perspective.

The ‘server’ perspective includes all the moving parts that power an application. Infrastructure, app servers, load balancers, the application, and all other elements that are under the direct control/purview of the engineer constitute the ‘server’ perspective. It helps define a bunch of known faults to which engineers can tie in any deficiencies. Here is an example of a Nagios alert configured on a sample Postgresql database:

command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z

define service {
    use                     generic-service
    host_name               postgres1
    service_description     Total Processes zombie
    check_command           check_nrpe!check_zombie_procs

This Nagios check finds all zombie processes and alerts engineers when there are too many zombie processes on the system.

While it is good practice to keep the system clean of zombie processes, does this merit an alert? Further, does an engineer have to be paged to take care of these kinds of problems? Are these problems time-sensitive and do they need to be solved in an urgent manner? Let’s answer this after we look at the second perspective.

Server and client perspective alerts are two sides of the same coin.

A ‘client’ has the best perspective about the result of any web application. If there are any latencies in the network, protocol retries, broken connections, and caching delays, the client has the most visibility into the resulting experience. If an application needs to aggregate data from various web services, only the client can give information about the usage from the resulting experience. With modern infrastructure which is dynamic, there is no way to build a one-to-one mapping between services and infrastructure. Only the client can provide the best result of what kind of experience these web applications provide.

If 100 users, accessing a website using a 2G connection from South America are experiencing an error does this represent a cause for concern for the product/development teams? Well, the answer depends. Is the core product a “lite” version of the main website meant to work at low-speed connections? Does the number of users represent a major fraction of total users of the application?

To make sure that the user experience is accurately reported, complete information that builds the context in which the client accessed the application must be made available (the ISP, the QoS, geography, etc). This by itself cannot become a reason to page engineers.

Using data from actual user interactions allows engineers to couple application behaviour tightly with SLOs. This provides a very robust way of managing the team and keeps them focused. This will also allow alerting to emerge from the most objective source of truth.

By definition – alerting is the act of informing an engineer that some component is going to break soon. The follow-up activity for an engineer upon the receipt of an alert depends on the runbook setup by the team. Data from application monitoring and real user monitoring needs to aid engineers in building a system that can provide some form of proactive alerting.

Designing alerts is very critical to the success of an on-call team. Poorly designed alerts can lead to one of three problems with alerting:

1. Too Many Alerts

2. No Alerts

3. Alert flapping

All of these problems are a result of the inability of engineers to separate the right signals upon which to base monitoring. Often times, alerts emerge only from the presence of simple problems, ignoring impact on users.

Let us examine the three principles we started with, in light of what real user monitoring and application monitoring can supplement with.

Alerting principleRUM valueAPM value
A Page must be sent when there is a known/impending SLO violationInformation about a degradation in the experience for a significant fraction of users extracted from data about interactions is available.Data about interactions mapped to individual services helps build an end-to-end trace, thereby typing into violations of constituent services
A page sent to an engineer must be responded to with utmost urgency.RUM data helps identify objectively if the incident is definitely affecting users (currently or impending).Mean time to resolution (MTTR) is cut down by a significant fraction, thanks to the information gained from Application Monitoring
Information included in a page must provide a good starting point for troubleshooting.Real user monitoring can expose the context & impact of every incident that caused the alert.Application monitoring can tie down the incident to where in the code the error is originating from.

These are the founding principles upon which Plumbr was built. This is our pitch to every DevOps team designing alerts and formulating on-call procedures. Plumbr works at the intersection of application monitoring and real user monitoring.

Because Plumbr builds on data that is collected from real usage, it is the most accurate reflection of what a user experiences when interacting with web applications. Plumbr monitoring works irrespective of which tech stack powers your application. Plumbr’s application monitoring capabilities are built by tracing each API call across distributed stacks.

Gain data on browser and device distributions

Full exposure of the availability and performance characteristics of digital services is available when you attach Plumbr. Visibility comes via summary dashboards, time-series visualizations, and pinpointing incidents to their origin in source code. As an engineer, you will spend less time in diagnostics and troubleshooting, thereby becoming able to resolve incidents faster.

Plumbr comes with a free trial, which makes it even easier than ever to develop meaningful and contextual alerts for your DevOps team. Try it today!