Making Memory Usage Transparent
Knowing how your application actually behaves in production is one of the key ingredients for running a successful digital business.
An application deployed on the Java Virtual Machine can face performance issues due to various reasons. There is a long list of threats which, if not monitored and taken care of, will result in poor user experience in terms of performance.
Without the evidence and root cause available to the engineering team, the quality of the service will eventually suffer.
There are a lot of solutions out there bridging this information from production deployment to business people, operations and developers. Each of them does it a bit differently and has its strengths and weaknesses.
One of such tools is our very own Plumbr which is now equipped with a truly interesting way to get actionable insights from your production deployments in regards of memory consumption. If you can bear with me for the next 10 minutes, you will learn an interesting way to make your data structures transparent.
Why should you care?
One category of performance issues are memory-related problems. The impact of the memory-related performance issues vary, for example:
- Capacity-related issues. Typically these take the form of “can we reduce our infrastructure costs”. As a simple example – if your deployments currently require an Amazon EC2 t2.medium instances with 4G memory, you are paying 2x more than in the case where you could use t2.small nodes with 2G of memory for the job. If you could reduce the size of the data structures, you could end up with 2x cost savings.
- Throughput-related issues. As JVM is running a background garbage collection process occasionally stopping the application threads for housekeeping purposes, these pauses reduce the amount of useful work the application can actually carry out. As an example – when your JVM has to do lot of garbage collection due to high memory usage, the total duration of a batch job for example can start taking 20-30% longer due to GC pauses.
- Latency-related issues. The very same GC pauses will directly impact the end user via increasing the duration of the operation running during the GC pause. When you application for example regularly faces GC pauses with the duration of 5 seconds, each of the users interacting with the application during the pause will experience an additional 5 second delay for the operation. The impact of this is the end user satisfaction, resulting in lost business.
All of these problems originate from the poor transparency of the data structures loaded into memory during runtime. And you can reduce the infrastructure costs, increase throughput and reduce latency by optimizing the memory usage of your application.
The current approach
The way insights to the data structures in memory are usually exposed is via memory dump analysis. The most commonly used tool for the job has been Eclipse Memory Analyzer which can analyze the heap dump taken from the JVM.
This approach poses several problems:
- The heap dump is what the name imposes – the dump of everything that resided in memory at the time the dump was taken. More often than not, the dump from production contains sensitive information – be it credit card numbers or personal medical records – there are a lot of applications out there restricting the access to this data. As a result, the engineers who could use the information to optimize the application just cannot get access to dumps, rendering the approach unusable.
- Whenever a heap dump is acquired, the application threads are stopped. This results in a lengthy pause during the application runtime, having significant impact to the latency. For large heaps (16G+) the time it takes to acquire the dump is measured in minutes.
- Taking the heap dump manually is effectively done at a random point in the application’ run time. As a result of this, the content of the memory dump might not accurately reflect the underlying problem.
- Dominator trees, retained & shallow sizes are just a couple examples of the domains you should be familiar with in order to understand the information exposed by the Eclipse MAT. More often than not, the person responsible for the optimization task at hand is not doing this regularly, meaning it takes a long time to interpret the information exposed.
As a result we have confidence in claiming that the current approach is both inconvenient and outright broken depending on the situation.
The Plumbr way
Plumbr has long been known for our capabilities of detecting memory leaks. As memory leaks are only a small subset of memory-related issues, we have expanded our offering and built a solution for our users to have memory transparency during runtime.
The solution is built upon taking heap snapshots after certain GC pauses. The exposed memory content would look similar to the one in the following screenshot.
As seen from the above, the top five memory consumers exposed by Plumbr are occupying 8G out of the 13G of total used heap. The screenshot contains explicit details about the biggest consumer, where apparently more than 2G is being logged through Log4J appenders.
Equipped with the information similar to the one shown above, you can start reducing the size of the most memory-hungry data structures right away. Looking at the example above, it is hard to imagine that logging 2G with a single logging call was intentional. Just changing the way this particular application handles logging could reduce the capacity requirements by 25%.
Based on the data captured from Plumbr deployments, 40% of the operations suffering from poor performance due to GC pauses, came from JVMs suffering from memory bloat. But more interestingly, the GC pauses in such JVMs take 30% longer to complete on average.
As a result of this, you can both reduce the infrastructure costs and reduce the perceived latency, as thanks to the reduced memory pressure, the GC pauses will be shorter and less frequent.
In addition, as you can see, Plumbr does not capture any sensitive information from the JVM. The actual contents of the heap never leave your servers, as Plumbr only captures the meta-information of the object graph. For example, we would only record that field number of a CreditCard object occupied 16 bytes, instead of actually storing the credit card number.
Following is an another example from a situation where Plumbr exposed the data structures in memory during runtime, making it transparent that the application was heading towards an imminent crash.
First snapshot exposed is taken three hours before the application died with the OutOfMemoryError:
As seen, the java.util.HashMap contains 70MB of data in the com.example.DataActualizer class buildMessage() method.
Over the following three hours, we captured two more snapshots similar to the one above, with the exception of the size of the said HashMap, which was constantly growing over the period. And just after three hours, the inevitable happened:
In addition to capturing the snapshots on certain GC pauses, Plumbr also captures the heap snapshot on OutOfMemoryErrors, exposing the memory contents at the time of death. And as seen from the above, the same offending HashMap has now grown to 236MB, effectively consuming 98% of the available heap and killing the JVM.
In the last month alone, we have had 104 application runs that died with an OutOfMemoryError: heap space. For 97 of these, we have exposed the same major consumers of memory while the application was still alive. So besides just equipping you with interesting information, you will also have an early warning system for OutOfMemoryErrors.
If the approach described above seems interesting, I can only recommend to take Plumbr out for a test drive and see the results yourself. While doing so, you might benefit from the detailed description about the approach we have taken:
- By default the GC pauses after which Plumbr harvests this information are the Major GC pauses after which the Old Generation consumption exceeds 50% of its maximum size. In addition, the capturing is limited to at most one time per two hours. Both of these parameters can be adjusted in Threshold Settings menu under Heap Snapshot
- When adjusting, pay attention to the fact that these limitations are in place to keep performance overhead at bay. Extracting the major consumer data can extend the duration of the GC pause by several (and in corner cases for several dozens) of seconds. In 90% of the cases we see the added stop-the-world pause being under 10 seconds.
- The duration of snapshot taking is proportional to the number of references in the JVM, so your mileage may vary. We have seen JVMs with 20+ gigabytes of heap where taking a snapshot took just one second. These additional pauses are made transparent to you via a specific root cause type “Collecting Memory Snapshot” which you can use to verify whether or not the impact is too high, and adjust the thresholds accordingly.