Product Update: Push Alerting Using Plumbr
It is vacation season now, and a lot of you are already holidaying – or have impending time off. Vacations are the perfect use case for alerting. We’re happy to announce the general availability of Push Alerting from within Plumbr. Just in time for you to be able to stay in control of your applications in production – wherever you are.
Imagine this – you’re the owner of a web application used internally within your organization. You’re currently out-of-office. But you would like to be notified when your users are facing a degraded experience. This degraded experience can come from failed interactions or slow responses from the server. None of the tools commonly used can easily provide this exact information. Log monitoring, infrastructure monitoring, network monitoring all fall short of identifying degradations in user experience.
This has been the problem that Plumbr has solved. We’re now extending this to be able to provide alerts based on this degraded experience.
You can configure Plumbr to send you alerts when your applications fail frequently. You can determine the frequency at which you want the alert to be triggered. For example, “Alert me when the Invoicing application has failed thrice out of the last 100 interactions”. What does this mean?
Invoicing is a critical part of any business. If the service that is helping you create invoices fails frequently, it causes operations to halt. Halted operations can’t be a good thing for anyone! Therefore, you can configure sensitive services (or applications) appropriately.
Let’s explore a few design choices that we made when creating Push Alerts from within Plumbr.
- We did not choose simple error rates because of general reliability issues. It takes a small number of users to trigger the alert under low traffic conditions. Many users would face failure when traffic is high, causing poor user experience to fall under the radar.
- The most obvious solution to this problem is to allow users to program date-time windows which will have granular thresholds. This can get very overwhelming quickly because of the many variables that an engineer would need to keep track of and manage.
- The best solution to this problem, therefore, is to use a sliding window. Error rates are calculated on a ‘rolling’ basis and this will keep engineers informed when error rates increase, relative to the total traffic on the application at any given time.
When configuring an alert threshold, Plumbr offers a ‘playback’. Using this, you can assess when your application would have fired off alerts during monitoring data from the past. Playback is helpful in calibrating and configuring the right numeric threshold for the alerts needed for your application.
Another design choice that we made when creating the alerting mechanism is to keep the recovery threshold 19% below the alerting threshold. This choice is to prevent alerts from flapping. It helps improve the signal-to-noise ratio for the alerts, making the alerts more useful and actionable.
We did not want alerting to be a system where engineers received dozens of beeps, notifications, and emails every day, which sent them chasing after false positives. If your monitoring system is causing you to context switch too often, you either have the wrong expectations set or are monitoring the wrong things. This can quickly lead to a buildup of Toil, which we’ve documented here in detail.
Currently, Plumbr can integrate directly via four channels – Email, PagerDuty, Slack, and Jira. Plumbr has always supported alerting use cases with its powerful API. If you would like to integrate your product with Plumbr, you’re welcome to use the Plumbr API to connect and make use of Plumbr in your toolchains.
Please get in touch with us on firstname.lastname@example.org for any help with alerting. We’re willing to work with your teams to figure out the best ways in which Plumbr can be made use of to help your teams.
Detailed documentation about setting up push alerts can be found here.