To blog Previous post | Next post
One of the primary metrics that engineers who are responsible for web applications like to keep track of is ‘availability’. Let’s examine what are the different approaches available to quantify availability, and the advantages and downsides of each approach. This will provide a good baseline for constructing a custom approach if required.
In order to illustrate the different approaches, let us assume a scenario which mimics some real-life outcomes for an engineering team that is developing and maintaining an e-commerce web application. Our goal is to derive a meaningful measurement for ‘availability’ of this e-commerce system. We will use the following failures occurring during the checkout process for users of the application:
- Special characters failure: Users with regular ASCII characters in their names are experiencing a smooth checkout. Users with non-ASCII letters in their names are stuck with validation errors when filling in their contact details and cannot complete the checkout.
- Geography failure: The checkout process is available for anyone accessing the shop from Europe or Asia. However, the users from North America are facing errors due to an outage in the CDN servicing users from this region.
Measured by time
Historically, availability has been a measure of what fraction of time a system is operational, in comparison to the additional time it takes to keep the system operative. By definition, this is:
Availability (%) = Uptime ÷ (Uptime + Downtime)
Common discussions center around this exact percentage (eg: 99.9%, 99.95%, etc). While the advantage of this is a seemingly straightforward approach, there are too many downsides to this method to consider it pragmatic. When a web application is divided into many composite services, it is very rare that a whole application is down for all its users. It is also an equally rare case to find every user enjoying a successful outcome. The median case is one where the services are partially operational, servicing a significant subset of the users just fine while failing miserably for others.
When we apply this understanding of availability, it will result in the service being characterised as 100% available, since there is no downtime. However, there is no way to be inclusive of the users experiencing the three example failures. Therefore, simply using time to characterize availability of a web application is not a reasonable approach.
How could we better measure service availability in a way that would be more aligned with our business goals? In case of an e-commerce site, the business goal is to sell goods and make money. Perhaps we could try to measure the monetary impact that our hypothetical availability issues have to our business? Let us find out next.
Measured in monetary terms
Lack of the ability to use the shopping cart implies that in addition to user experience, revenue for the business is also affected. Assuming an average shopping cart size of $40, the table now transforms as:
|Failure||Nr of failed checkouts||Monetary loss|
|Spl characters||3||$ 120|
In reality however, it is quite rare to be able to pin exact revenue figures to user interactions. If one were to drill-down further into the checkout process, there are measures such as conversion rates, usability heuristics, and other factors that would distort the picture. Can we be sure that the IE11 users who experienced errors did not actually make a purchase? Perhaps they used a different browser and completed the purchase? May be they never intended to buy anything?
The further you move away from parts of the application that deal with revenue, the more complex it becomes to correlate revenue with failures experienced. For example, how would “Failure to load a live-chat widget” correspond to lost revenue? When using internal-productivity applications, where revenue isn’t even a factor, this approach would be wholly inappropriate.
Measured as a proxy to money
We have seen some of the downsides to characterizing availability by time, or to revenue. Is there a way to normalize the impact of these interactions to a factor other than these two? Here are a few approaches you can use, depending on the business you are in:
- Number of failed API calls.
- Gives a consolidated measure of the health of the application.
- Easy to capture from logs or an APM deployed to the API on the server-side.
- Number of users experiencing failures.
- Gives an objective measure of who is experiencing failures
- A DIY approach using cookies can generate this data
- Requires a link to be established between users and failures
- Many RUM vendors offer this as a product feature.
- # of user interactions failing.
- you can build a link between a specific activity (Click on “Checkout” on /cart)
- Good RUM vendors will equip you with this information
Here is one of the ways in which applying this understanding to these failures can be viewed:
|Failure||Nr of users affected|
With Plumbr, our goal is to equip engineers and engineering teams with data that makes it easy for them to quantify the impact of failures. Each engineer should have the necessary transparency into what they consider a good measure of availability. Equipped with this information, they should be able to gain control of user experience, be able to fix errors faster, and ultimately provide users with a reliable digital experience.
Icons made by Freepik from www.flaticon.com is licensed by CC 3.0 BY