Time in distributed systems
During the past months a lot of my time has been spent in building the support for distributed traces in Plumbr. Among everything else I found myself dealing with different issues related to handling the time in distributed systems.
Apparently time in such systems is a whole lot different concept than in non-distributed systems. Consider for example a deployment where calls to the system are intercepted at node A. Node A makes a call to node B which in turn calls node C. The calls are synchronous and until the response is returned, the caller thread is blocking.
Now if the these three nodes would have slightly different system clocks, we could easily end up with distributed traces like seen below:
The trace above does not make sense in several aspects, for example:
- Node B claims to have received the request 5 seconds before node A sent the request to it
- Node C says it responded to B 10 seconds later than B seems to have received the response.
So we had to find a way how to solve this issue and find a way to handle the time when capturing distributed traces. But before jumping to our solution, lets see how the system clocks can end up drifting apart in the first place.
How can time drift happen?
Every software deployment is relying upon the presence of a system clock. Without going too deep into hardware world – it is a small quartz crystal located on the motherboard oscillating at (seemingly) predictable rate. This rate can be used to monitor how many time units have passed by since the crystal started oscillating.
Apparently there are two fundamental issues with this approach
- These crystals do NOT oscillate at truly predictable and unified manner. Tiny differences in their composition result in what is known as time drift. If you unbox two computers manufactured at the same time in the same factory and plug them on at the same time, you will still end up with different system clocks on the machines after a while.
- There is no way to say to the crystal to know what is the current time. All it knows is to oscillate and count ticks. But whether it started counting on 1st of January 2000 or 31st of December 2016 it has no idea. This is why every system setup requires you to input the current date & time. As this includes human interaction along with imprecise clocks used as reference, you are guaranteed to have different times set on different nodes.
To compensate for both of the fundamental issues, a solution called NTP is built. It builds upon synchronizing the system clocks of computers against a centralized reference clock. Having all the nodes periodically check the system clock against such a reference clock would then limit the time drift in such a system.
Unfortunately this is exactly the case with NTP. It does not eliminate the time drift, it only reduces the drift.
To make things from bad to worse – you are only rarely in control of all the nodes in your system. This means you cannot synchronize them against the same NTP service. In case you doubt this – what about the end user devices interacting with the webapp you do control in your server room? There is no way to enforce the same NTP service on all the end users, so by definition, you are guaranteed to have nodes where the current time is set to some completely arbitrary moment in the timeline
I will stop here without going deeper into the issues why the system clocks can and will drift apart from one another. What you can count upon is the fact that you are guaranteed to have different times in different nodes in your distributed systems. From what we have seen the time drifts can range from just milliseconds to decades. Yes, DECADES, some system clocks out there are reporting themselves to be counting time towards the end of 1996.
Solving the time drift for our use case
As we saw, the time drifting cannot be eliminated for distributed systems. So we had to come up with a custom solution taking the presence of the drift into account. It builds upon the following requirements:
- The time drift in nodes participating in the distributed transaction must be limited
- Order of events must be correct
- Alignment of the events must make sense
Lets now see how we fulfilled the requirements.
Minimizing the time drift
Limiting the time drift was the easy part of the solution. It used the following approach, relying upon the Agent – Server deployment model of Plumbr:
- Agent initiates handshake with the Server. The handshake has a timeout of 60 seconds, after which the Agent assumes the Server is not available and drops the connection attempt to retry later.
- Server accepts handshake and initializes certain data structures needed for the new connection. Then, right before sending response to the handshake, Server injects its current timestamp into response.
- Agent receives response, grabs the timestamp from the response and adjusts its time to be equal to this timestamp received
This way we can guarantee that the Agent time is not different from Server by more than 60 seconds. It can occasionally even increase the drift for some Agents, but it builds us a safety net of never dealing with more than 60 seconds of drift.
As an example, let’s check the following situation:
- Agent sends a handshake to Server at 00:30 according to the system clock in Agent machine.
- Server receives the request at 04:55 according to Server time.
- Server injects the timestamp to the response at 05:00 according to Server time.
- Agent receives the response to the handshake at 01:00 according to Agent time.
- Now the Agent adjusts its internal clock (we are never touching system clocks) by four minutes forward (the difference between Agent and Server clocks) on each timestamp it generates.
This way we have effectively introduced our own custom-built time synchronization against which the nodes are syncing the clock each time the node is (re)connecting to our monitoring server. Now let’s move on to the more interesting part and see how we can order the events in the distributed system in a way the order will make sense.
Aligning the events
Our solution for this was to align the spans on timeline after we have received them. This way we are not relying upon any kind of distributed clock. Let us proceed with the algorithm description:
Consider we have a distributed transaction involving two nodes: A and B:
Node A starts the transaction, performs some work, sends a request to node B, waits for response, processes response, does some more work and finishes
This transaction now contains two spans: a and b. Each span has the following attributes
- parentId (nullable)
- Four timestamps:
Note that b.requestSent and b.requestReceived are captured in different nodes likely deployed on different physical machines. That’s why we can’t use b.requestReceived – b.requestSent to calculate network latency between JVMs A and B.
Now that we have the model, we can start describing the alignment algorithm using the same example of a distributed transaction as in the beginning of the post:
The transaction above took 1m 30 seconds to complete and consisted of three spans, each originating from a different node. The picture above does not make any sense, as:
- Node B claims to have received the request 5 seconds before node A sent the request to it
- Node C seems to have responded to B 10 seconds later than B claims it received the response.:
Now our algorithm would start by locking the time in one of the JVMs and aligning all other spans according to it. We start with locking the time in JVM A in place. This means that we next need to shift time in JVM B so that the span alignment would start making sense.
Apparently the correction can be anything in between 5 and 25 seconds forward shift:
- A sent out the request to B at 00:40 and B claims to have received it 00:35. Having locked the time in node A and assuming time travel is not possible, it becomes clear that we need to shift the time in node B at least 5 seconds forward.
- B sent the response back to A at 01:30. A claims to have received it at 01:55. From this we can deduce that the shift cannot be greater than 25 seconds, otherwise we would again have reinvented the “time travel” which we agreed is impossible.
Now the algorithm could pick any forward shift in JVM B between 5 and 25 seconds. Instead of doing a random pick from this range, the algorithm goes one step further and picks the average of the endpoints, 15 seconds in this case. Why so?
As explained earlier, our model is imperfect in regards of not knowing the exact network latency for each request or response. It does however know the sum of the latency added by each request-response pair:
- Node A sent out the request at 00:40 and received it at 01:55. Whether or not the timestamps are correct is not relevant, but we can be certain that the delta of the timestamps is correct and we can claim that the node A was blocking for one minute and 15 seconds while waiting for the Node B to respond.
- Node B in turn knows that it took him 55 seconds to respond to the call received (claiming it received the request at 00:35 and responded to it at 01:30)
Using this information we can see that the total latency for this request-response round trip was 20 seconds. Without knowing how much of this 20 seconds was added during request and how much during response, the algorithm treats these as equal and makes sure there is 10 seconds network latency gap both before the start and after the end of the span from node B:
The algorithm would now attempt to repeat the same step for node C, but it turns out that span c is already perfectly aligned. So we have a consistent transaction aligned relative to JVM A with the start time is 00:30.
If we could trust A to have correct time, we could stop here. Unfortunately we can’t so we go ahead and repeat the process and lock other spans one by one. Doing so gives us different transaction start timestamps based on the time in each node.
For example, repeating the procedure and aligning spans relative to JVM B would result in a following transaction being assembled:
Aligning spans according to Node B, the transaction start drifts to 00:15. Repeating the process again and locking the Node C, the start time of the transaction would be 00:30. So we now have three opinions about when the transaction could have started:
- Node A thinks it started 00:30
- Node B is sure it started 00:15
- Node C is convinced the start was at 00:30
Having multiple opinions now allows us to use the wisdom of the crowd and pick the median of the start times to be used as the alignment base. Median of the dataset above happens to be 00:30 which corresponds to the alignment relative to JVMs A and C. Now our algorithm is finished and will use the alignment around JVM C as the correct version.
So, apparently when it comes to the concept of time in distributed system then the primary goal is to achieve correct order of the events. Only when this is achieved, positioning such events on timeline will start making sense.
We have achieved just that and are able to serve you the distributed traces from your systems by limiting the time drift in all participating nodes and realigning the spans on timeline based on what the majority of the nodes think the correct time is.
By doing so, we admit that the outcome is not precisely aligned on the timeline. But for practical purposes, we have minimized the drift and achieved a way to compose a distributed trace in a way which makes sense for our users. So in case you are interested in seeing the monitoring solution described in this post in action, just go ahead and grab our free trial to check it out.