Plumbr now supports alerts on throughput metric
From January 30, 2020 Plumbr users can configure alerts to trigger in situations when the throughput of the application monitored is either abnormally low or high. For example, the following alerts can be set up:
- Send an alert to Slack chat #devops whenever the number of user interactions in eshop-production application during the last 10 minutes is below 5.
- Send a page to Pagerduty whenever the usage of Payment API exceeds 10,000 API calls / hour.
With these alerts you can stay on top of situations where the application becomes unavailable for end users. You will also be immediately alerted when you become a victim of a DoS attack.
Setting up the throughput alert
Throughput alerts can be set up by users with administrator privilege under the application/API detail view “Alerts” menu item:
Alert trigger consists of two parameters:
- Time range during which the throughput is measured.
- Number of user interactions or API calls during the time period that will trigger the alert
Using these two parameters you can set up two seemingly similar alert triggers:
- Trigger an alert when during the last 1 minute there has been less than 10 API calls.
- Trigger an alert when during the last 10 minutes there has been less than 100 API calls.
When zoomed out then indeed – both triggers activate on 10 API calls/minute throughput, but the key is in the time range selected. First alert would trigger when during a single minute 9 API calls was monitored by Plumbr. Second alert would not care about the minute-by-minute fluctuations and would trigger only if during the 10 minute period there was 99 or less API calls being monitored.
Our recommendation is to analyse the historical behaviour of the API or application in the alert set-up screen to see when the particular combination would have triggered an alert and tune until you can verify that the throughput alerts are indeed triggering only in situations where you need them.
Frequently asked questions
Q: I do not wish to receive an alert during night times / weekends / public holidays. How can I accomplish this?
A: Awesome, it seems you have really thought through how the on-call in your company should operate. Basing the alerts/pages on the SLO/SLA which oftentimes tends to be based on specific schedules is part of this maturity. However, we do believe that the role of any monitoring solution, Plumbr included, is to be the provide a signal and stay away from scheduling/filtering. Filtering and routing the alerts based on the signal is best left for dedicated solutions, such as PagerDuty. For example PagerDuty has extensive scheduling possibilities which I can recommend you to take a look at.
Q: What should the thresholds be – how many minutes should I wait before triggering an alert? Or how low the throughput of API calls/user interactions can drop before the alert triggers?
A: It heavily depends of the service under management. For some heavily utilised API endpoints, the drop below 10,000 API calls/minute might indicate a problem. For some applications the situation where there is just 100 user interactions/hour might be perfectly fine. So our advice is to look at the historical throughput and tweak the throughput number until you are satisfied with the alerts that would have been triggering.
Q: What about the dynamic baselining? Can the alerts be based on dynamic thresholds?
A: We are not fans of dynamic baselining. Dynamic baseline tends to be complex to set-up in the real world due to all the non-deterministic events that are bound to happen down the road. From what we have experienced, 99.9% of our customers are better off with static alerts combined with the schedules.