Splunk Completes Acquisition of Plumbr Learn more

To blog |

Discovering the systems your application is integrated with

March 29, 2017 by Ivo Mägi Filed under: Monitoring

The post is covering the experience I accumulated while implementing a particular product feature of Plumbr. The specific feature was a part of our Runtime Application Architecture discovery and visualization. This visualization gives you an overview how your end users are accessing different services in your infrastructure. The outcome would look similar to the example below:

discovering application architecture real world example

The specific feature in question was responsible for discovering all the “downstream” integrations a particular Java Virtual Machine is connecting to. These integrations would then be visualized on the runtime application architecture, just as the yellow nodes in the example above.

The key aspect here was that the downstream nodes had nothing installed on them from the perspective of Plumbr deployments. All we had was the hope that we somehow can find which nodes the JVMs are connecting to using just the information available in the nodes where our Java agents were residing.

Apparently we indeed were able to do just that. The foundation for the solution we picked relies on the aspect that almost every outbound connection from a JVM is making use of either java.net.Socket or java.nio.channels.SocketChannel APIs.

So conceptually it was as easy as instrumenting the corresponding Socket.connect() and SocketChannel.connect() implementations to be informed when and where the connection was established. The destination where this particular Socket/SocketChannel connects to is conveniently available in corresponding connect() method arguments. Capturing this information gave us the possibility to capture the host and port the connection is made.

As we needed to embed the feature in the context of runtime application architecture, we also needed to know when the connection was no longer used. This information became available when we started instrumenting the corresponding close() invocations on the Socket and SocketChannel classes. As a fallback, we also keep track on the connection references, so the eligibility for GC also sends us a signal that the particular connection is no longer active.

If this seems simple and straightforward, I have bad news for you. If you wish to go ahead and put the inspiration into practical use, be warned that the exercise proved to be more complex than originally expected. Numerous aspects revealed their ugly head as soon as I started to dig into the implementation details:

  • Naively tracking outbound sockets quickly became a performance issue due to the sheer amount of traffic steered towards the downstream nodes. Apparently the solution of just tracking the connection create and destroy events relieves most of the pain. As the TCP connection creation is expensive operation, most of such connections are pooled and reused, so the instrumentation burden is not that high in practice.
  • Even when just tracking connection creation and closing events, the number of endpoints some applications connect to is staggering. This complexity became a true issue when trying to visualize the outcome of the detected integrations. To keep focus of this post, I will cover how we handled this issue in a separate blog post.
  • The captured traffic is useful only when certain meta-information can be extracted from it. There are certain risks involved even when collecting simple hostname, ip and protocol information for opened socket. For instance, net.InetAddress.getHostname() can trigger reverse DNS lookups, if corresponding socket address was created via IP, not hostname. Triggering additional synchronous network calls is not a very wise thing to do from performance monitoring perspective.
  • There are several implementations of Socket and SocketChannel. The instrumentation logic must be generic enough to not miss any of the relevant implementations and specific enough to not track irrelevant implementations in class hierarchy.
  • The captured connections had to be timestamped so that the information could be used to visualize the architecture of a particular application in a particular moment in time.

We continue to polish the solution based on the initial feedback, but already now you can benefit greatly by just understanding how your applications are actually integrated in production. Spotting the unnecessary coupling between nodes, detecting cyclical dependencies or capturing outright security issues on test environments tapping into production backends have already been the first benefits for the customers who had early access to the solution. So if you have not yet seen how your system architecture looks like in the real production deployment, go ahead, grab our trial and find it out.