Prioritizing Availability Issues
Embarking on deciding the priority of errors can lead to confusion and often times conflicts in teams. Engineers need to accept that not all errors need to be fixed before starting to prioritize. Let us begin with the very same question – “Do we need to fix all the errors users are experiencing?”.
Availability issues are perennial. The non-deterministic nature of the real world forces these issues upon us, which we must accept. Picking on the right kind of issues is a difficult task for engineers proud of their work. Here are a few examples of what kind of failures originate from behavior we cannot estimate or have control over:
- An integration built with a particular delivery vendor is failing. About 2% of users, who are choosing this delivery option are facing failing interactions. As a business, a decision has been made to sever ties with the vendor due to logistics and operational issues. This would result in this integrations becoming obsolete. It will be removed from the list, resulting in users not being able to opt for it at all. Given that the integration is only going to be available for just a few weeks, and other options available to users, is it worth fixing now?
- Some availability issues surfaced on the single AWS availability zone that services were deployed to. One of the options to alleviate this is to invest in a multi-zone deployment. Considering that the next issue is only likely to occur after a long time, is the migration necessary? Is it worth the associated costs – time, effort, and money?
This is by no means an exhaustive list of the different possibilities. Engineering teams face such predicaments everyday and make decisions on the right tradeoffs. Since engineering resources are limited and there is always a long list of competing new features your business/product owners wish to deliver to improve the business, compromises must be made. Accepting that all systems will contain availability issues that will never get fixed, is a step towards improvement. For some errors, “Won’t fix!” is the only reasonable decision to make.
The next ‘phase’ in the engineering lifecycle when it comes to addressing failures is to validate a list of failures, order them by priority, and pick one of three outcomes – “Will fix now”, “Will fix later”, or “Will not fix”. Let us build on this theory with a hypothetical example. Assume that this is the data provided to us from our real-user monitoring system that is attached to a web application in production:
|List of Failures||Impacted users|
Given this information, it is easy to make a choice. Since ‘Failure C’ impact the most number of users, let us fix that, and proceed in the decreasing order of impact. However, this data is missing some key components required to make an informed decision of which order to proceed in.
Mitigating any failure requires some effort from engineers, which means that each fix costs the organization a finite amount of effort, time, and money. Without tabulating these estimates, a conclusion cannot be drawn about what the priorities are. By adding that information to the table, it now modifies as follows:
|List of Failures|| Impacted users |
| Cost to mitigate |
| Efficiency of fix |
Adding these dimensions of data now reveal the efficiency of the efforts to mitigate these failures and find fixes. Here is a breakdown of the tabulated information:
- Fixing Failure A can happen in one day.
- As a result, two users per week are no longer experiencing errors.
- The efficiency for fixing Failure A is that one engineering day spent gives us two users who are no longer facing errors.
- For Failure C, eight full calendar weeks of work are required. As a result will only give you one user who is no longer experiencing errors per engineering day spent.
Now, this kind of analysis provides a completely different vantage point from which to view the priorities of the engineering team. Each team needs to ascertain the best method of arriving at the efficiency of a fix and then proceed to work on them.
We have seen how failures are perpetual, and the list of potential failures can remain ever-growing. How can teams know when to stop fixing failures? Engineering teams are tasked not just with the responsibility of fixing failures, but also with building new features and adding to the product. The best way to balance this is to calculate the impact on business metrics and ascertaining the ROI per fix.
However, even this might not be very pragmatic. It may not be possible to arrive at these data points for all the errors you will be encountering. Two simple rules will help you create a basis for knowing what to fix and when to stop:
1. Use a metric that gives you efficiency against effort to prioritize your list of errors.
Engineering teams should not be making a decision based on the list of errors. They should be looking at the aggregate impact to make sure the service is operating with good enough quality for now. Create an agreement with the business that the expected quality of the service is represented in the number of users who are not experiencing errors in any given time frame. Operate your failure-mitigation efforts based on these benchmarks.
2. Know that error lists are top-heavy, where large gains come from small efforts.
Every real-world list of errors is top-heavy. Our experience shows that for most of the applications, the top 3 errors by the impact constitute for a large fraction of the total impact. There will always be a long tail of errors impacting just a few users, most of which can be safely ignored.
Plumbr was built with the goal of helping engineers work with objective evidence about availability issues on their production applications. This helps remove guesswork in how applications in production are managed. Equipped with data from actual usage, engineering teams can prioritize their efforts correctly and apply them to the right areas.