To blog Previous post | Next post
Garbage Collection: increasing the throughput
The inspiration for this post came after stumbling upon “Pig in the Python” definition in the memory management glossary. Apparently, this term is used to explain the situation where GC repeatedly promotes large objects from generation to generation. The effect of doing so is supposedly similar to that of a python swallowing its prey in whole only to become immobilised during digestion.
For the next 24 hours I just could not get the picture of choking pythons out of my head. As the psychiatrists say, the best way to let go of your fears is to speak about them. So here we go. But instead of the pythons, the rest of the story will be about garbage collection tuning. I promise.
Garbage Collection pauses are well known by their potential of becoming a performance bottleneck. Modern JVMs do ship with advanced garbage collectors, but as I have experienced, finding optimal configuration for a particular application is still darn difficult. To even stand a chance in manually approaching the issue, one would need to understand exact mechanics of garbage collection algorithms. This post might be able to help you in this regard, as I am going to use an example to demonstrate how small changes in JVM configuration can affect the throughput of your application.
Example
The application we use to demonstrate the GC impact on throughput is a simple one. It consists of just two threads:
- PigEater – simulating a situation where the python keeps eating one pig after another. The code achieves this via adding 32MB of bytes into a java.util.List and sleeping 100ms after each attempt.
- PigDigester – simulating an asynchronous digesting process. The code implements digestion by just nullifying that list of pigs. As this is a rather tiring process, then this thread sleeps for 2000ms after each reference cleaning.
Both threads will run in a while loop, continuing to eat and digest until the snake is full. This happens at around 5,000 pigs eaten.
package eu.plumbr.demo;
public class PigInThePython {
static volatile List pigs = new ArrayList();
static volatile int pigsEaten = 0;
static final int ENOUGH_PIGS = 5000;
public static void main(String[] args) throws InterruptedException {
new PigEater().start();
new PigDigester().start();
}
static class PigEater extends Thread {
@Override
public void run() {
while (true) {
pigs.add(new byte[32 * 1024 * 1024]); //32MB per pig
if (pigsEaten > ENOUGH_PIGS) return;
takeANap(100);
}
}
}
static class PigDigester extends Thread {
@Override
public void run() {
long start = System.currentTimeMillis();
while (true) {
takeANap(2000);
pigsEaten+=pigs.size();
pigs = new ArrayList();
if (pigsEaten > ENOUGH_PIGS) {
System.out.format("Digested %d pigs in %d ms.%n",pigsEaten, System.currentTimeMillis()-start);
return;
}
}
}
}
static void takeANap(int ms) {
try {
Thread.sleep(ms);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Now lets define the throughput of this system as the “number of pigs digested per second”. Taking into account that the pigs are stuffed into the python after each 100ms, we see that theoretical maximal throughput this system can thus reach up to 10 pigs / second.
Configuring the GC example
Lets see how the system behaves using two different configuration. In all situations, the application was run using a dual-core Mac (OS X 10.9.3) with 8G of physical memory.
First configuration:
- 4G of heap (-Xms4g –Xmx4g)
- Using CMS to clean old (-XX:+UseConcMarkSweepGC) and Parallel to clean young generation -XX:+UseParNewGC)
- Has allocated 12,5% of the heap (-Xmn512m) to young generation, further restricting the sizes of Eden and Survivor spaces to equally sized.
Second configuration is a bit different:
- 2G of heap (-Xms2g –Xmx2g)
- Using Parallel GC to conduct garbage collection both in young and tenured generations(-XX:+UseParallelGC)
- Has allocated 75% of the heap to young generation (-Xmn1536m)
Now it is time to make bets, which of the configurations performed better in terms of throughput (pigs eaten per second, remember?). Those of you laying your money on the first configuration, I must disappoint you. The results are exactly reversed:
- First configuration (large heap, large old space, CMS GC) is capable of eating 8.2 pigs/second
- Second configuration (2x smaller heap, large young space, Parallel GC) is capable of eating 9.2 pigs/second
Now, let me put the results in perspective. Allocating 2x less resources (memory-wise) we achieved 12% better throughput. This is something so contrary to common knowledge that it might require some further clarification on what was actually happening.
Did you know that GC stops 20% of Java applications regularly for more than 5 seconds? Don’t spoil the user experience – increase GC efficiency with Plumbr instead.
Interpreting the GC results
The reason in what you face is not too complex and the answer is staring right at you when you take a more closer look to what GC is doing during the test run. For this, you can use the tool of your choice, I peeked under the hood with the help of jstat, similar to the following:
jstat -gc -t -h20 PID 1s
Looking at the data, I noticed that the first configuration went through 1,129 garbage collection cycles (YGCT+FGCT) which in total took 63.723 seconds:
Timestamp S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
594.0 174720.0 174720.0 163844.1 0.0 174848.0 131074.1 3670016.0 2621693.5 21248.0 2580.9 1006 63.182 116 0.236 63.419
595.0 174720.0 174720.0 163842.1 0.0 174848.0 65538.0 3670016.0 3047677.9 21248.0 2580.9 1008 63.310 117 0.236 63.546
596.1 174720.0 174720.0 98308.0 163842.1 174848.0 163844.2 3670016.0 491772.9 21248.0 2580.9 1010 63.354 118 0.240 63.595
597.0 174720.0 174720.0 0.0 163840.1 174848.0 131074.1 3670016.0 688380.1 21248.0 2580.9 1011 63.482 118 0.240 63.723
The second configuration paused in total of 168 times (YGCT+FGCT) for just 11.409 seconds.
Timestamp S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
539.3 164352.0 164352.0 0.0 0.0 1211904.0 98306.0 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409
540.3 164352.0 164352.0 0.0 0.0 1211904.0 425986.2 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409
541.4 164352.0 164352.0 0.0 0.0 1211904.0 720900.4 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409
542.3 164352.0 164352.0 0.0 0.0 1211904.0 1015812.6 524288.0 164352.2 21504.0 2579.2 27 2.969 141 8.441 11.409
Considering that the work needed to carry out in both cases was equivalent in regard that – with no long-living objects in sight the duty of the GC in this pig-eating exercise is just to get rid of everything as fast as possible. And using the first configuration, the GC is just forced to run ~6.7x more often resulting in ~5,6x longer total pause times.
So the story fulfilled two purposes. First and foremost, I hope I got the picture of a choking python out of my head. Another and more significant take-away from this is – that tuning GC is a tricky exercise at best, requiring deep understanding of several underlying concepts. Even with the truly trivial application used in this blog post, the results you will be facing can have significant impact on your throughput and capacity planning. In real-world applications the differences are even more staggering. So the choice is yours, you can either master the concepts, or, focus on your day-to-day work and let Plumbr to find out suitable GC configuration according to your needs.
Anyhow, hopefully I managed to simplify some complex concepts via the example. If you wish to stay tuned on future posts, subscribe to our Twitter feed and get our content about Java performance tuning.
Comments
Could you rite more why it behaves like this? Is it due to small young generation allocated in the first example? And thus GC works all the time regardless it’s algorithm it basically works all the time?
or… you can use G1 and let it tune itself….
With the parallel collector settings you had above, I got:
Digested 5001 pigs in 538420 ms. – 9.3 per second for parallel collector
with java 1.8_20 and the very minimal G1 settings of:
-Xms4G -Xmx4G -XX:+UseG1GC
Digested 5019 pigs in 526185 ms. = 9.53 pigs per second
Truth be told, we once have reached to the community and asked if somebody had good experience with G1 collector in performance critical setting. Answers were… discouraging.
Perhaps you should have reached out to the actual GC tuning users list?
I personally have tuned 6 cache servers with a 96gb heaps that has been running for over a year in production with zero full gcs, a 99.97% throughout – max pause time under 300ms and 99% pauses under 150ms
G1 is an absolute dream. Once you tune it, it intelligently will adapt for different allocation rates – something that cms can never do.
Monica Beckwith has several excellent articles about g1 tuning.