Splunk Completes Acquisition of Plumbr Learn more

To blog |

Improving Engineering Decision Making

February 26, 2019 by Ivo Mägi Filed under: Blog Monitoring

Plumbr collects data about user interactions and application performance. We present the result of some important data analysis experiments we conducted based on the monitoring data collected by Plumbr. The conclusions drawn by our team were as follows:

  1. No matter what your application is, it is bound to contain errors.
  2. Not all errors that are occurring need to be fixed.
  3. Fixing just a few select errors can bring major improvements.

We analyzed over 400 different applications monitored by Plumbr. We picked a 6-month period between July 2018 and Jan 2019. Here is a description of the data set:

  • 400 different applications monitored by our application monitoring products
  • 19 billion calls served
  • 100 million failed calls over the period
  • 4,500 distinct errors causing these failures.

This is the distribution of errors that we observed:

Distribution of errors in applications

Some statistics:

  • Applications with no errors = 0
  • Median number of errors per application = 13
  • Minimum errors in an application = 1
  • Maximum errors in an application = 1931
  • Number of applications with >200 errors = 35
  • Average number of errors per application = 72

Every application, therefore, has some errors. The good news though is that not every error has to be fixed. There are many errors that occur which you can choose not to fix. Here are a couple of examples:

  • An integration built with a particular delivery vendor is failing. About 2% of users, who are choosing this delivery option are facing failing interactions. As a business, a decision has been made to sever ties with the vendor due to logistics and operational issues. This would result in this integrations becoming obsolete. It will be removed from the list, resulting in users not being able to opt for it at all. Given that the integration is only going to be available for just a few weeks, and other options available to users, is it worth fixing now?
  • Data about visitors reveals that a fraction of users are using Internet Explorer 9 to access your website. As a result of compatibility issues, they are facing several errors, a large fraction caused by the JavaScript used. To resolve the problem, a version of the website specific to visitors using this client would be required. Would the return-on-investment be justified?

Engineering teams face such predicaments every day and make decisions on the right tradeoffs. Since engineering resources are limited and there is always a long list of competing new features your business/product owners wish to deliver to improve the business, compromises must be made. Accepting that all systems will contain availability issues that will never get fixed, is a step towards improvement. For some errors, “Won’t fix!” is the only reasonable decision to make.

Our experience in working with various software systems over the years has led to us making this bold claim:

Fixing errors with the biggest impact would reduce the error rate significantly.

For every call made, the HTTP response code determines the outcome. Calls with a 4xx or a 5xx series response codes are classified as “Failed”. While the rest are classified as a success. For every application, each error type was ranked by the number of failed calls that they spawned.  The ‘relative impact’ of each error was then calculated as the fraction of its impact against the total number of failures, expressed as a percentage. The following distribution was derived based on the relative impact of the top three errors across all the applications (using averages):

  • Error #1 accounts for 32% of total failures
  • Error #2 accounts for 15% of total failures
  • Error #3 accounts for 9% of total failures

This may look like it proves the claim. However, since averages are notoriously misleading when trying to make predictions for a specific situation, we decided to slice and dice the data further to verify this claim. For each application, we divided the data into one-week stretches and the top-errors were isolated, their impact calculated. This gave rise to the following observations:

  • 50% of the time, the single most impactful error causes at least 25% of all failures
  • 50% of the time, the three most impactful errors cause at least 54% of all failures

Here are a couple of additional insights:

  • In 1 out of every 4 weeks, the total impact of the top-three errors is above 80%
  • In 1 out of every 10 weeks, the total impact of the top-three errors falls below 20%

So, there you have it. A claim about how a small number of errors are affecting web applications, quantified and verified by data. Even among your applications, these claims should hold true. Use your Plumbr agents to verify the claim and start to make improvements.

Expand your use of Plumbr. Incorporate it in planning meetings. Use it to determine the priority of tasks that your engineering team will tackle this sprint. Talk to us to find out how many different ways it can fit into your current workflows. Incident management, troubleshooting, and reporting are some of the areas we can help improve.

Write to csm@plumbr.io to discuss how we can help you set this up. We have a shared mission to give your users a faster and more reliable digital experience. Your success is our success!