
How Plumbr will help you
First of all, Plumbr helps by clarifying whether or not there is a leak in the first place. It is quite common to misattribute an observed issue to a memory leak. A mere occurrence of an OutOfMemoryError is not sufficient evidence, as there may easily be another reason like a sudden huge allocation. Excessive GC activity may also be caused by many things other than a memory leak. The most reliable indicator is that the Old Generation usage is growing, and Major GC fails to clear it up. Plumbr exposes it on the technical details page on the heap usage chart:
In this picture we see that the application’s heap usage has been growing at an average speed of over 1 GiB per hour. The more live objects there are in the java heap, the longer and more frequent garbage collection becomes. So, after the leak has matured, the application was brought down on its knees with a lot of long garbage collection pauses:
To help figure out why the leaked objects are being retained and not cleared, Plumbr captures memory snapshots of unhealthy applications. By default, a snapshot is captured if more than 50% of old generation is in use after a Major GC pause, but no more often than once every two hours. Additionally, snapshots are captured post-mortem if an OutOfMemoryError is thrown. In this case, we captured a snapshot just before the application really started suffering:
Looks like the worker thread is failing to clear up the queue. The natural outcome of this situation is an OutOfMemoryError, which was indeed faced about 15 minutes later.
The Solution
In the situation above, to fix the issue we had to make sure that the worker threads were resilient to errors. Having the queue consumer stop is a very common kind of issue. Generally, looking at a snapshot that exposes the path to GC roots of major memory consumers gives an immediate understanding of what the leak consists of. From there, it’s a very application-specific process of preventing that leak. Here are just some typical actions that may be involved:
- Put a cap on cache or queue size
- Remove the reference to an object from a static field
- Unregister a listener on termination
- Stop a timer thread
- Explicitly release a native reference to an object