Means to improve the signal quality for alerts.
In Plumbr we want to make sure our customers will stay on top of performance and availability aspects of the digital assets monitored by Plumbr. Crucial part in achieving this includes training our users in order for the alerts triggered to be of high quality – triggering if and only if the performance or availability really need to be in focus of your team.
In this post I will walk you through four aspects you need to think through when setting up the alerts using the signals from any RUM or APM vendor.
Govern low throughput. Let us assume you have the following performance objectives set for the service under management:
At any given 24h window, the following must hold for 99% of any 10-minute window:
- Median latency time must be under 500ms
- 99th percentile of latency must be under 2,500ms
If the service under management has steady throughput, you are all set. More often than not, you are managing a service which has low utilization overnight, weekends or public holidays:
As the example above shows the application under monitoring is being actively used from Monday to Friday, between 08AM to 08PM. On weekends and night times the usage drops 10x compared to the business hours.
As a side effect of the fluctuating throughput, your performance alerts can start triggering in unwanted situations. An expensive operation which during daytime is just a fraction of the traffic can constitute 10%, 50% or even 100% of the traffic in any 10-minute window during the wee hours of the night. As a result, you might get a wake-up call in the middle of the night only to discover that the lone user at 3AM just opened up the expensive monthly report moving the median latency of the last 10 minutes way beyond the 500ms we were after.
In such situations you have effectively two solutions
- Alter the objectives, so that only business hours are covered with the SLO. Looking at a throughput patterns similar to the one above, I am willing to bet that this is a back-office application, which if not available outside the regular working hours is not considered a major incident.
- Use rolling cumulative instead of the fixed windows. Instead of checking the median or 99th percentile of the last 10-minute time window, check the latency distribution of the last 1,000 or 10,000 or 100,000 API calls to iron out the outliers.
Beware of synthetic monitoring checks triggering false alarms. Quite often the real usage monitoring is complemented by synthetic checks. While seemingly not related to one another, you might encounter a situation where synthetic checks periodically accessing specific API endpoints will skew the monitoring signal representing the real usage:
The screenshot above is from one of the APIs monitored by Plumbr, exposing the 99th percentile latency over a period of one week. As seen, the 99th percentile seems to peak on a daily basis, correlating with low usage at nights.
Investigating the situation more closely, we revealed this to be caused by the combination of the synthetic checks and low usage during night-time. The checks themselves were rather consistent in terms of latency, returning the response in 350-450ms, they represented a larger proportion of the total throughput during low usage periods, thus skewing the signal for low usage periods.
So – whenever you do employ synthetics, make sure to exclude these endpoints from the real usage monitoring to preserve the quality of the signal.
Batch jobs can affect the quality of the signal. A typical asset under monitoring is mostly busy with transactional work – responding to incoming requests. However, many of such transactional applications do host several housekeeping batch jobs as well. The nature of the batch jobs might be different in nature, ranging from nightly back-ups to hourly payment processors, but the impact to monitoring signals is often the same. During the batch job run, the application starts to behave differently than outside of the batch runtime.
When thinking about this it should feel obvious: the system which for 99% of the time is kept busy by responding to individual cheap read operations is now dealing with a single I/O or CPU heavy job. This can and will change the way the signals for alerts will behave, especially so for latency-related metrics.
In situations like this, the best option is to exclude the batch jobs from impacting the transactional systems. In case this is impossible or impractical, you are left with mediocre options, such as excluding the batching periods from alerting or sending a lower-severity alerts during such periods.