Distributed Request Tracing With Plumbr
This dashboard above is a sample of how many services come together to provide a summary view of business operations. Analysts have crafted a business apparatus using the requirements of a team and this borrows from several systems running separately. In some cases such as identity providers, they are entirely independent systems. The downside – When there is a failure in displaying the data for a user, how does troubleshooting begin? Are UI/UX teams paged? Are frontend engineers expected to take up responsibilities? Do you page the team responsible for building the rest APIs that are called?
The most common answer we’ve heard is – “It depends”. It’s almost impossible to use “it depends” as a long term answer, which is why Plumbr puts forward a different answer. Our solution is able to identify the right teams and eliminates manual troubleshooting. And ‘Distributed Request Tracing’ plays a big role in this.
Generally speaking, distributed traces benefit engineering teams because they provide critical visibility. Traces make a transaction visible beyond the bounds placed by a subsystem or process. This is achieved by instrumenting code to provide a trace as each request interacts with the services constituent of an application. Every team that contributes code to build the application are required to build the infrastructure that can supply the trace too.
Typically, all distributed tracing systems consist of 3 main components –
- An underlying data model that describes the trace. This takes the form of a unique id that is propagated through each call spawned by a request.
- A central server that collects data from production software and assembles traces.
- Agents that are attached to nodes and clusters running on production. These agents are responsible for creating the trace data as the server processes each request.
Engineering teams will have to architect these three components for each part of their infrastructure that they would like to trace.
An alternate way to gain traces is to use one of the many libraries that provide tracing capabilities to code. Zipkin and Jaeger lead the pack here. They are the most popular ways in which teams build tracing, adhere to open standards and have a fairly large community that contributes to the core actively.
A caveat, however – implementing these are not simple or straightforward. There are several layers of complexity that engineering teams would have to peel off in order to make these functional. Zipkin or Jaeger, in particular, do not offer a SaaS version either.
Perhaps the simplest way to get tracing working is to buy one of the several vendors who offer hosted servers that assemble and help visualize tracing data. Every APM vendor offers a version that engineers can use.
Plumbr APM offers tracing capabilities too. However, we have made a few changes to the way tracing is traditionally done.
- Our data models have support for Root Causes. Typically, tracing models include only outcomes and durations. In addition to these, Plumbr provides an additional data point for root cause information and stack traces to be captured along with each error or bottleneck.
- Our agents are built at the lowest level possible. This allows us to assemble traces that are language agnostic.
- Plumbr traces begin from the browser. By including information about user interactions, the Plumbr dataset exposes a large amount of detail that is skipped by others. This provides complete feedback with rich and granular information about what a user was doing when encountering errors and bottlenecks.
- Our overheads are minimal. It would be pretty ironic if a performance management tool degraded the performance of software it is meant to monitor. We’ve blogged about it in the past if you’re interested in details.
- Data from Plumbr can be reconstituted to develop service topology, product analytics, flame graphs, and several relevant histograms.
Incidents do not happen in isolation. They tend to affect all the services upstream and downstream from them. If a web application responds slowly, it could be that one of the services involved had a really slow response, or that each hop through the call stack contributed to the overall slow response of the system. It is difficult to pinpoint this behavior when relying entirely on aggregate metrics alone. Especially when using distributed infrastructure and microservices architecture. Unless each request is broken down to the constituent service it interacts with, it is difficult to say with certainty where bottlenecks originate.
In this example, you can see how a dashboard is affected by the services that are downstream from it. For a user who is interacting with the dashboard, most of the data is displayed quickly. There is one data point that is not available and another one takes a long time to load and be displayed. Using distributed request tracing, it will be easy to break the request down into what services are failing, which are underperforming, and what is the critical path that a request takes. Due to a processing delay in the data store and an availability issue with the backend, the dashboard remains incomplete for a particular user.