Fooled by monitoring
Monitoring is usually considered to be a mature domain. Every business keeps an eye on its IT assets, often with the help of several different tools. Over the years of building our own monitoring solution we have seen almost every tool in action.
The experience received has been eye-opening in regards to how different organizations are either not deploying the right tools for the job or are deceiving themselves with the information they have access to. Depending of the organization, this dysfunctionality can take different forms, for example:
- Living in ignorance. Not knowing how your end users use and perceive your product is affecting your bottom line to the extent you can not even imagine.
- Misinterpreting the data. Whenever an organization is using metrics such as “average response time” and “availability rate” as the only factors worth looking out for, the actual user experience is really nowhere in sight.
- Limiting monitoring output to IT. The domain has its roots deep in the IT and is thus treated as something that the business and product owners do not benefit from directly.
In this article I am going to give you an overview of different monitoring solutions that businesses use to track their IT assets. I will explain when and how a particular tooling category is beneficial and give examples of how the output can be misinterpreted. The tools are divided into the following categories:
- Infrastructure monitoring
- Log monitoring
- Health checks
- Web analytics
- Application performance monitoring
If you manage to bear with me through the following, you will be well equipped to get IT operations, engineering and business/product owners on the same page as you. At least when it comes to the monitoring domain.
Monitoring different system-level metrics in your infrastructure is the first type of monitoring deployed in almost any organization. Keeping an eye on CPU, memory, network and disk usage is something that is easy to set up. The outcome is also seemingly simple to interpret, but let us look into how you can end up drawing the wrong conclusions by accident.
Thresholds are defined for the monitored metrics, and when they are exceeded, alerts are triggered. Such alerts inform the system administrator that a certain machine is consuming more resources than expected:
These alerts can be a valuable source of information in situations where a particular disk is close to becoming full or where anomalies occur in resource consumption.
You should keep in mind that raw system metrics are not a good way of expressing the health of the application. When you interpret data such as the following:
- the CPU has been utilized 100% for the past 20 minutes,
- RAM usage has steadily climbed to 90% since yesterday,
- we are currently utilizing the network at 7 Gbit/s,
you cannot really tell if any of the facts above are symptoms of something good or bad. That depends on what the application is actually doing with all these resources. Maybe the application got stuck in an infinite loop because of a programming error, and is now eating 100% of CPU. Or maybe your latest marketing campaign is working great, and you just have 10 times as many users as before.
Another use case for system monitoring solutions is related to the cost of your infrastructure. Having access to the system monitoring telemetry allows you to make educated decisions regarding capacity planning and monitoring. Understanding whether or not your application belongs to the 90% of the deployments that are more than 2x over-provisioned enables you to significantly reduce infrastructure costs.
DO: Deploy infrastructure monitoring to:
- Receive early alerts for resource exhaustion.
- Harvest information for capacity planning and provisioning
DON’T: Infrastructure monitoring is not for:
Understanding what your end users are actually experiencing.
The second source of information is application logs. This approach is almost as common as system-level metrics monitoring.
02-Feb-2016 13:54:24.989 INFO [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:. 02-Feb-2016 13:54:25.295 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["http-nio-8080"] 02-Feb-2016 13:54:25.421 INFO [main] org.apache.tomcat.util.net.NioSelectorPool.getSharedSelector Using a shared selector for servlet write/read 02-Feb-2016 13:54:25.430 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["ajp-nio-8009"] 02-Feb-2016 13:54:25.433 INFO [main] org.apache.tomcat.util.net.NioSelectorPool.getSharedSelector Using a shared selector for servlet write/read 02-Feb-2016 13:54:25.433 INFO [main] org.apache.catalina.startup.Catalina.load Initialization processed in 1767 ms 02-Feb-2016 13:54:25.546 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service Catalina
Raw logs, as seen above, are not too easy to understand and analyze, especially when your deployments are spread across tens or hundreds of machines. In such cases, the logs tend to be streamed to centralized logging services for aggregation, analytics and alerting purposes.
Again there is a multitude of vendors in the domain. One of the most frequently used solutions is the ELK stack, where the logs are stored in Elasticsearch, analyzed in Logstash and visualized with Kibana:
The downside of the approach is that the information your logs contain as well as the patterns you can define to extract this information from the logs are limited.
To understand the limits, think of this: if the application does not log a particular event, you will never know it happened. Also, if a particular log record does indicate a problem, you will only find out about it if you are explicitly looking for it.
As such, building your monitoring solution on the logging approach tends to be obtrusive – you end up coupling the logging calls tightly to your application code and polluting the business logic with statements that do not necessarily belong there, just for the sake of monitoring.
DO: Use logging to:
- Debug information in business-critical parts of the application.
- Aggregate and store the logs to be the source of information during troubleshooting activities.
DON’T: Threats of logging include:
- Doesn’t offer a reliable way to detect poor user experience.
- Pollute your code and business logic.
The next type of monitoring solutions is periodically checking the availability of certain services. Such checks are implemented via synthetic calls to application endpoints. The calls are then periodically invoked to monitor whether or not the endpoint responds. In addition, many tools like this also monitor the latency of the operations, making it possible to spot latency-related issues.
The most widely used tool in this category is Pingdom, whose output can be seen from the following screenshot:
Based on this data, the monitored application seems to be doing just fine. Throughout the week, there seem to have been two short downtimes and the latency of the monitored endpoint was constantly between 400 and 600 ms. However, this screenshot is from one of our own applications that actually suffered from several serious performance issues during the week. These issues cannot be seen with this type of monitoring.
The reason for this is that the amount of functionality covered with such health checks tends to be very limited. Often, only the front page is actually checked. Even on the rare occasions when all business-critical services are covered with checks, the input data used for the checks does not represent real usage. As such, you will still be left in the dark and won’t know when the application is unavailable or performs poorly for particular users with specific usage patterns or data sets.
DO: Deploy health checks to:
Periodically ping your public services to see if they are available from different parts of the globe.
DON’T: Expect health checks to
Reflect actual user experience via synthetic tests.
A single player in the market heavily dominates the fourth category of monitoring solutions. Google Analytics is the de facto standard in capturing information about ways how end users are actually using your web-based applications.
It is a great source for understanding where the users come from, how they convert through the funnels and how they use your product. The changes you make in the application can be compared to previous versions and can expose the truth about whether or not these changes actually improved the experience.
The downside of this approach is that there is information that Google Analytics does not expose. It does close to nothing in exposing users who are unhappy with your service either due to bugs in the service, the unavailability of a particular function or just the poor performance of the application.
Since Google Analytics is very good at the things that it is good for, its weaknesses are hidden so well that businesses often end up investing in the wrong places. A typical example would involve tweaking the landing page conversion rates to gain an extra 0.3% of users, while blissfully ignoring the 6% of disappointed users struggling with your application due to performance or availability issues.
To be honest, Google Analytics has limited capability in exposing these performance metrics:
What you should pay attention to is that this data is captured only from a small sample of users. In addition, this data is presented deceptively, exposing only averages and a comparison to the baseline. As you can see in the next section, the averages are not that good of a cornerstone in understanding what is really happening with your users.
DO:Deploy web analytics to:
Keep a close eye on conversion funnels and product usage patterns.
DON’T: Use web analytics as:
- The only source of information used by the business/product owners.
- A reliable source data to understand the performance of your application.
This category of monitoring solutions is called Application Performance Monitoring. APM solutions measure the transit of traffic from actual user requests. Whenever a user interaction takes place, such an event is captured along with its start and end time as well as the functionality consumed. Many vendors in this category (AppDynamics, NewRelic, Dynatrace) already capture such interactions in the end user layer, be it a web browser or a mobile app.
This approach builds a strong foundation for understanding your end users’ experience. However, the information is often not represented in such a way that it would actually benefit the interpreter. Instead of visualizing the actual dataset, only averages or medians are used as guidelines.
Trusting the averages
To understand the downside of trusting a single number, just take a look at the following visualization. All the datasets below have the same mean, median and variance:
The infamous Anscombe’s quartet above makes it clear – if you do not visualize the dataset you will not really be able to understand the situation. Instead of looking at a mean/median digest, you should keep an eye on the latency distribution of your business operations over time.
To understand the previous statement, think about your own craft. I bet that the application you are building/monitoring/designing exposes many different services. Logging in, adding an item to a shopping cart or extracting the history of purchases are just a few examples. Each operation can have different requirements regarding latency, meaning you should actually slice your application into individual functions to understand the real user experience.
Trusting the latency distribution
Slicing your application and keeping an eye on latency distribution is the next level of maturity at which organizations tend to arrive. Slicing the dataset and exposing the 90%, 99% or 99.9% percentiles on the application performance is a valuable information to be used in improving the user experience.
As such, you should then understand that for example yesterday, your payment service behaved like this:
Seeing that 99% of the payments were completed in under 86 ms, you may be deceiving yourself. Numbers without an explanation are still not very useful. Unless you understand what the remaining 1% consists of, you are in for nasty surprises.
Our own recent experience demonstrated that the users who suffered the most and made up the last 1% were our best customers. Being heavy users of our monitoring solution, they ended up creating 10 or 100 times more data than average users, bringing our data access algorithms to their knees.
So besides tracking the services, you should also keep an eye on your users, preferably even named users if possible. Knowing how your priority pass holders or platinum members experience the service might make all the difference in the world.
DO: Use an APM solution to:
Expose end user experience via latency distribution per service.
DON’T: Deceive yourself by:
- Relying on single numbers representing averages or medians as key metrics.
- Trusting just the application performance metrics instead of monitoring for named user experience.
- Expecting (most) APM solutions to assist you during root cause detection.
If you got this far in the post, take some more time and think about how you monitor your own IT assets. Do you actually know the answers to these questions:
- How are the end users actually experiencing the application?
- Which users are suffering the most?
- Have I over-provisioned my infrastructure?
- How are my business-critical services performing?
If some of the questions are left unanswered, maybe it is time to do something about it. Having the answers to the questions above will make a world of difference to your bottom line.
The steps to monitoring epiphany might sound expensive, but the rewards are well worth it. So I can only encourage you to start introducing the change via deploying the following:
- Infrastructure monitoring to detect both under- and over-provisioned resources.
- Log monitoring to harvest additional information during service failures
- Health checks to receive early alerts on availability issues
- Web analytics to understand how the users are converting through funnels and how they are using your site
- Application performance monitoring to expose actual end user experience
The order of these steps depends on your organization, but each category is designed to solve a different problem and exposes different kind of information. The key here is to analyze the level of maturity you currently have and couple it with the biggest challenges you face business-wise.
If the post was interesting to you, I bet you are also going to enjoy the follow-up where I cover a way to link business metrics to performance metrics. To stay tuned, subscribe to our Twitter or RSS feeds.