Distributed tracing in practice
Incident management: an example
Let us start with a hypothetical support ticket landing on your desk:
From: John To: firstname.lastname@example.org Subject: Cannot complete checkout I just tried to complete the order #32828, but was unable to finish the checkout. The UI stalled for 20 seconds and then gave me an error.
Eager to help out the customer, you dive into troubleshooting. Fast forward two weeks and you will find yourself receiving the seventh email in the series, similar to the one below:
From: John To: email@example.com Subject: Re:Re:Re:Re:Re:Cannot complete the checkout Managed finally capture the HAR file from my browser using the modified instructions. However, it is too big to be sent as email attachment. Please advise.
Apparently, the problem is still out there, and the solution is nowhere in sight. What on earth happened during these two weeks?
What we see happening in such situations is that the following process launches
- Attempts are made to reproduce the issue in test/dev settings. More often than not these attempts fail, either due to the discrepancy in set-up, data or load in such environments in comparison to the production
- Next the attempt switches to capturing the evidence from the production settings. The tools used vary, based on the prior knowledge of the engineer equipped with the task. But we see very different tools being used and our studies reveal that on average six different sources are being utilized to make sense of the situation. What we also see is the lack of systematic approach – more often than not it tends to boil down to trial-and-error in hopes that eventually some clues will be revealed.
- In parallel, customer is asked for additional evidence and more people are summoned in to find the culprit. And eventually some heroic and clever insights are captured which will lead to resolving the issue
The problem is obvious – the time and effort spent on resolving a case are high, leading both to reduced customer satisfaction and increased costs.
Let’s see if this could be different when distributed tracing is present and being used.
Incident management using distributed tracing
Let’s assume that a distributed tracing solution, such as Plumbr was adopted before the support ticket landed. Now the very same process would look very different:
First step would still involve verifying the complaint:
Traces being equipped with tags, such as the user identity (email in this case) allow you to expose the user experience. And the traces themselves will be including the evidence for the failure if this was indeed the case, similar to the example below:
Now it can both be verified that indeed submitting the order failed for the particular user and that the root cause originated from the credit card charge attempt.
After the claim has been verified, next step in line is to prioritize the response. In practice, this particular incident this would not be the only issue that the production deployment is facing. So how to understand whether or not the underlying error is a high-priority one?
Again, information captured by the distributed traces would give you the answer:
As seen from the above, the chargeCreditCard span exposes the error thrown from the undelying runtime while attempting to execute the operation. It has also mapped the error to all the traces impacted by the very same error and it in the specific situation it seems to be that the user is the only one suffering from its impact. As a result, the response can be objectively prioritized. Notice that this was due to the distributed tracing being able to map the errors to the traces and aggregate the impact by the number of different traces including and failing due to the very same error.
So, is that it? We have verified the error being really present and the impact being low enough to avoid escalation. Can we move on? Before doing just that, let us expand the example and start tracing already at the browser used by the actual end user (vs the server-side traces exposed before):
A different picture is revealed – it is likely not the chargeCreditCard that is the real culprit here. The checkCredit microservice has also failed during the very same user interaction and there is a reason to suspect that the credit card changes fail subsequently due to this.
Expanding the traces to see the errors with their impact gives us additional information:
We can now see that there have been 150 traces that have failed due to the bug in checkCredit, changing the priority of the issue significantly. It is also likely that the chargeCreditCard has just faced collateral damage and could be ignored for the time being.
Before calling it a day, let us see how the situation can be improved even further. With the tracing in place, you really shouldn’t wait until the end users start complaining. Instead you can set up alerts upon the metrics extracted from the distributed traces to be aware of the availability (and performance) issues as soon as the impact starts to manifest.
With quality tools such as Plumbr it is as easy as setting up alerting on the metrics derived from traces. In this case, the error rate alert would have been giving us the signal:
The time-series chart above exposes the throughput of the monitored application in time as well as the error rate. There is also an alert set up to notify the on-call team over PagerDuty in situations where the error rate exceeds 1.3%.
With the alerting in place, you will be notified in the preferred channel (PagerDuty, Slack, email, JIRA, …) whenever the users start experiencing availability issues (or poor performance, in case of which latency alerts should be set up).
And when responding to the alert, the on-call team would also have exposure to the list of errors and bottlenecks impacting the users the most, similar to following:
If an alert would trigger, then looking at the list above, it would be obvious that:
- Fixing the three errors with biggest impact would result in 99% of the impact being mitigated
- Fixing just the error with biggest impact would already give 66% of the gains.
Using the example, we have seen how distributed tracing helps the on-call teams following a simple process:
- Set up alerts triggering on the impact
- Send the alerts to the channels used by the on-call team
- Respond to incidents using the root causes revealed.
Note that this became possible thanks to the distributed tracing being adopted. The benefits materialized because good tracing solutions will
- tag the traces, supporting multi-dimensional querying
- link errors and bottlenecks to the traces, allowing the prioritization of different bugs
- do not use head-based sampling, but instead capture all the traces and use tail-based sampling to make sure all the traces exposing issues will be exposed.
If you made it this far in the post, I can only recommend you take a look how easy it is to set up distributed tracing using Plumbr