To blog Previous post | Next post
Creative way to handle OutOfMemoryError
This can’t be happening. This was pretty much the only thought in my head when staring at the log files. The JVM generating those logs was getting SIGTERM signals out of nowhere and disappearing without a trace.
Let me repeat the last sentence – several times a week someone was deliberately killing an otherwise perfectly decent Java batch job.
The mystery started to unwind when I managed to get access to the rest of the data from within this machine.
Apparently this java process was not the only inhabitant in this machine. I discovered another JVM running on the very same box. The second JVM was apparently running a small webapp deployed to the Tomcat app server. What immediately caught my eye was the correlation in availability of those applications. It seemed that whenever the problematic batch job died, Tomcat was also facing some sort of outage. But Tomcat seemed to recover shortly after, as the batch job remained dead.
An hour later I had went through the Tomcat log files and found another interesting pattern. Right before the Tomcat restarts the all-too-familiar java.lang.OutOfMemoryError: heap space was staring right into my face. So apparently the Tomcat was dying due to lack of memory. But it was still not explaining why the batch job was behaving like it was.
And then I found myself watching at the following parameter in the JAVA_OPTS used to launch Tomcat:
I was not even aware there is such an option available. But apparently you can indeed register a shell script to be executed when your JVM has run out of memory. The OutOfMemoryErrors indeed are a painpoint if they have deserved a special flag in the JVM.
But jokes asides, the author of this solution did not take into account the potential of having several Java processes in the same JVM. And so he fired a SIGTERM to all Java processes in this machine. Mystery solved.
Did you know that 20% of Java applications have memory leaks? Don’t kill your application – instead find and fix leaks with Plumbr in minutes.
Moral of the story? If there was a world cup for guys who are into hiding the symptoms instead of solving the underlying problem, this one would have made it to the semi-finals. Why on earth should you think it is a good idea to solve a shortage of memory and/or memory leak in such a peculiar way?
If you belong to the ranks of engineers who always try to get down to the root cause then subscribe to our Twitter feed for performance tuning advice.
The approach is not 100% incorrect. Yes, the application should be restarted because after an OOM you cannot trust the data in the memory, who knows what you’ve missed allocating memory for? Not only that but if you’re dealing with a leak that gulped all the memory the application already performs really crappy, doing mostly GC so no reason to hold on to it. And yes, for high availability I would expect that once it’s dead a new instance should be started.
BUT the actual solution is funny, killing all java processes on that machine 🙂 Plus if the JVM was not configured to dump the heap on OOM, the programmer really didn’t give a damn about forensic data that could help identify the problem and lead to solving the actual issue.
My, may be naive, understanding is that on encountering OOM Tomcat should go offline without external help. If this does not happen, I would file a bug to Tomcat.
And your comment about absence of heap dump is totally correct.
Well, you’re totally right, any app encountering OOM or any other Error for that matter, should die on its since Error are fatal failures in the virtual machine, mostly unrecoverable. That is, unless some programmer catches them and allows the application to continue in an unstable state (I did see a lot of that). Even if that wasn’t the case, maybe the one that wrote the mega-kill just wanted to make sure the old app is really dead before restarting it, maybe to make sure the ports are unbound. He/she should have just found the correct PID based on program name and kill only that one. Maybe also collect a histogram which could offer a quick glimpse over the cause of the failure.