OOMkillers in Docker are more complex than you thought
One of the customers using Plumbr APM was recently facing a peculiar issue in their production environment where one of the Docker containers in production exited with the exit code 137. The setup of the environment was fairly simple and Matryoshka-like, nesting different containers and VMs:
- self-hosted hardware with Ubuntu OS;
- multiple Docker containers also running Ubuntu OS running on the machine
- Java Virtual Machines running inside the Docker containers.
Investigating the issue led us to Docker documentation on the topic. Reading this made it clear that the cause for this exit code is either manual docker stop command or out of memory condition coupled with subsequent kill signal sent by the kernel’s oomkiller.
Grepping through the syslog confirmed that oomkiller of the OS kernel deployed on the hardware was indeed being triggered:
[138805.608851] java invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [138805.608887] [<ffffffff8116d20e>] oom_kill_process+0x24e/0x3b0 [138805.608916] Task in /docker/264b771811d88f4dbd3249a54261f224a69ebffed6c7f76c7aa3bc83b3aaaf71 killed as a result of limit of /docker/264b771811d88f4dbd3249a54261f224a69ebffed6c7f76c7aa3bc83b3aaaf71 [138805.608902] [<ffffffff8116da84>] pagefault_out_of_memory+0x14/0x90 [138805.608918] memory: usage 3140120kB, limit 3145728kB, failcnt 616038 [138805.608940] memory+swap: usage 6291456kB, limit 6291456kB, failcnt 2837 [138805.609043] Memory cgroup out of memory: Kill process 20611 (java) score 1068 or sacrifice child
As can be seen from this log excerpt, a limit of 3145728kB (around 3GB) was being approached by the Java process, triggering the termination of the Docker container. This was peculiar, because the Docker itself was launched with 4GB limit in the docker-compose file.
As you likely know, JVM also limits its memory usage. While Docker itself was configured to 4GB limit in the docker-compose file, the JVM was started with Xmx=3GB. This might create additional confusion, but pay attention memory usage for the JVM at hand can exceed the limit specified by -Xmx, as described in one of our previous posts analyzing JVM memory usage.
Understanding all this left us still confused. The Docker should be allowed 4G of memory, so why on earth is it OOMkilled already at around 3GB? Some more googling revealed that there is one more memory limit involved at the OS deployed directly on the hardware.
Say hello to cgroups. cgroups (aka control groups) is a Linux kernel feature to limit, police and account the resource usage for a set of processes. Compared to other approaches (the ‘nice’ command or /etc/security/limits.conf), cgroups offer more flexibility as they can operate on (sub)sets of processes.
In our situation, the cgroups limited the memory usage (via memory.limit_in_bytes) to 3GB. Now we were getting somewhere!
Inspection of memory utilization and GC events using Plumbr revealed that most of the time memory utilization of the JVM running inside the Docker was around 700MB. The only exceptions occurred just before the termination events, where memory allocation spiked. The spike was followed by a lengthy GC pause. So what seemed to be going on was:
- Java code running inside the JVM was trying to allocate a lot of memory.
- JVM, after checking that there is plenty of memory to use below 3GB Xmx limit asked for more memory from the operating system.
- Docker verified that there is plenty of memory to use below its 4GB limit and did not enforce any limits.
- OS Kernel verified that the cgroup memory limit of 3GB is approached and killed the Docker container.
- JVM process got killed along with the Docker container before it can perform its own OutOfMemoryError procedures.
Understanding this, we tuned all memory limits for the involved components – 2.5 GB to Docker and 1.5GB to Java. After this was done, the JVM was able to conduct the OutOfMemoryError operations and throw its OutOfMemoryError. This enabled Plumbr finally do its magic – capture the memory snapshot with relevant stack dumps and promptly expose that there was one database query, that tried to download almost whole database in certain situation.
Even in such a simple deployment, three different memory limits were involved
- JVM via the -Xmx parameter
- Docker via the docker-compose parameter
- OS via the memory.limit_in_bytes cgroups parameter
So, whenever you face a process getting killed by an OOM killer, you need to pay attention to the memory limits of all involved controls.
Another take-away is for Docker developers. It does not seem to make sense to allow such “Matryoshkas” to be launched where the nested Docker container memory limit is set higher than the cgroups memory limit. A simple if statement checking this on container startup, coupled with an appropriate warning message could save hundreds of debugging hours for your users.