APM myths busted – #1 Adding monitoring will make my app faster
This is the first post in the series where I will be debunking many of the myths surrounding the application performance monitoring domain. During the forthcoming weeks I will be busting many other myths or urban legends surrounding our domain:
- APM is valuable only for the software developers
- APM provides value only during root cause detection
- APM main value lies in QA processes and should be used in staging environments
- APM pricing makes it unusable in the era of cloud-based dynamically provisioned microservices.
- APM provides a single metric that can be used to measure the quality of all the services under monitoring.
But for this week, lets continue with the first myth – namely the belief that adding an APM to your production monitoring will be making your applications faster and more reliable.
It will not. It is equivalent to a belief that buying a sport watch will make you lose weight. Extending this analog – similar to a sport watch, APM is using sensors to capture data from the underlying system turning it into information.
The information exposed by the sport watch must be understood and used in order to gain anything. After all, if at the end of the day you are looking at the sport watch and see 3,000 steps done during the day, you are not becoming slimmer or more fit. But if you knew that 10,000 steps a day is a key to a healthy lifestyle and are motivated to follow this advice, you actually stand a chance. Just put on running shoes and go outside to take a ~45 minute walk, you have actually improved your life quality.
If we look at the example more closely then the key lies in understanding when, how and why to use the information provided. In the example above, it started by the person both believing in 10,000 steps being the key to a healthy life and having the motivation to chase the goal. After having those two key pillars in place, it all boils down to making sure the watch clocks in 10,000 steps per day.
The same applies to an APM. Buying any APM will not make the application faster or remove any of the availability issues. On the opposite, capturing this information tends to add some overhead (typically measured in low single digit increase in resource consumption).
To start reaping all the goodies (strong signal/low noise alerts, explicit root causes through distributed tracing, etc), the IT operations and development teams must understand how, when and why to use the information provided by the APM. Only then the team is managing the digital services, the benefits can materialize.
For this to happen, a hypothetical team being responsible for a particular digital service needs to:
- Formalize the expected quality requirements (as
SLA or SLO) in terms of availability and performance. If you are wondering how
these might look like then the following examples might give you an idea:
- The error rate of the API published must not exceed 0.2% for 99% of the time in any given month.
- Median response time of the API must be under 700ms for 99% of the time in any given month
- 99th percentile of the response time of the API must be under 5,000ms for 90% of the time in any given month.
- Use the metrics captured by APM as the source signal for alerting. Any good APM player integrates with the major solutions used for the purpose, so all you need to do is to integrate the APM with the Pagerduty or Slack or any other channel you might be using.
- Have authority to respond in situations where the objectives are not met without any approval from external parties.
By doing so, the motivation and goals are in place along with clear understanding when and who needs to respond to any of the performance or availability issues.
What can go wrong when implementing the approach?
So why so many IT operations/devops teams out there do not follow such simple practices? Based on my experience, the foundation for failure is set by failing to determine either of the two:
Not understanding why the performance and availability of the service should be in focus. Indeed, if you have not figured out why a particular aspect of the service should have priority over other topics in the backlog, it will be hard to justify any action to be taken. More often than not this happens due to not being able to quantify the impact a particular performance or availability degradation is having to the business the company is in.
It is understandable that in a situation where you have competing tasks in the backlog all requiring the limited engineering time, you need to make the decisions based on the value of the task. And in situations where you have a new product feature, blocking several sales deals from closure or a marketing campaign projected to bring in X new leads, it might be hard to have the engineering focus set on the task of bringing the median response times of the application down from 1,100ms to 700ms.
The ultimate knowledge in this field can be derived via understanding the correlation of performance metrics and business metrics. This is specific to the business the company is running. For example:
- For digital media, improving the median latency by 20%, the engagement over content increases 6%
- For e-commerce, the conversion rate in a funnel step increases by 17% if the first contentful paint of the step is reduced from 2,500ms to 1,300ms
Gaining this understanding for your particular services might sound complex. If this really is the case, then our recommendation is to follow your instincts initially. After all, you do not need a full-blown model to understand that median render times exceeding 5,000ms will end up in a really frustrated user base which in turn will have an impact of the business success.
So as an initial implementation of the SLOs, we recommend to use the baseline from the current metrics as the “normal” and agree that degradation on this for more than 20% would not be acceptable and needs engineering response.
Not being able to gain the authority to respond without involvement from the business / product owner. This one is a little trickier. Business / product owners are used to the current status quo where the priorities on the backlog tasks are their responsibility. Agreeing on a particular aspect being outside of their control and responsibility might be tough.
In this situation we recommend to focus on the positive aspects of the approach:
- If the product owner agrees upon the SLO/SLA levels, then there is little value to be added by him/her when it comes to handling the individual incidents. If he will be the one receiving the alert then there is one additional step in the incident response workflow, delaying the eventual mitigation. And every hour added in this process will mean that more and more users will experience the issue.
- Having this aspect of the service removed from the product owner tasks leaves more time for him/her to deal with the aspects of the product where his/her expertise really adds value. More time for market research, designing product features or new building new marketing campaigns will mean a higher quality decisions made in those fronts.
Do not fall into the pitfall believing in APM magically improving your application. That being said, there is a whole lot of value to be gained from APM adoption. Making sure you are aware of the typical pitfalls will set you up for a success – the faster and more reliable digital services can and will end up making your business a more successful one.
If the post sparked inspiration, check out some more in-depth materials we have prepared about APM best practices: