To blog Previous post | Next post
Should you trust the default settings in JVM?
JVMs are considered smart nowadays. Not much configuration is expected – just set the maximum heap to use in the startup scripts and you should be good to go. All other default settings are just fine. Or so some of us mistakenly thought. Actually there is a lot going on during runtime which cannot be automatically adjusted for performance, so I am going to walk you through what and when to tweak throughout a case study I recently faced.
But before jumping to the case itself, some background about the JVM internals being covered. All following is relevant for Oracle Hotspot 7. Other vendors or older releases of the Hotspot JVM’s most likely ship with different defaults.
JVM default options
First stop: JVM tries to determine whether it is running on a server on a client environment. It does it via looking into the architecture and OS combination. In simple summary:
|32-bit SPARC||2+ cores & > 2GB RAM||Solaris||Server|
|32-bit SPARC||1 core or < 2GB RAM||Solaris||Client|
|i568||2+ cores & > 2GB RAM||Linux or Solaris||Server|
|i568||1 core or < 2GB RAM||Linux or Solaris||Client|
As an example – if you are running on an Amazone EC2 m1.medium instance on 32-bit Linux you would be considered running on a client machine by default.
This is important because JVM optimizes completely differently on client and on server – on client machines it tries to reduce startup time and skips some optimizations during startup. On server environments some startup time is sacrificed to achieve higher throughput later.
Second set of defaults: Heap sizing. If your environment is considered to be a Server determined according to the previous guidelines, your initial heap allocated will be 1/64 of the memory available on the machine. On 4G machine, it would mean that your initial heap size will be 64MB. If running on extremely low memory conditions (<1GB) it can be smaller, but in this case I would seriously doubt you are doing anything reasonable. Have not seen a server in this millenia with less than gig of memory. And if you have, I’ll remind you that a GB of DDR costs less than $20 nowadays …
But this will be the initial heap size. The maximum heap size will be the smallest of either ¼ of your total memory available or 1GB. So in our 1.7GB Amazon EC2 m1.small instance the maximum heap size available for the JVM would be approximately 435MB.
Next along the line: default garbage collector used. If you are considered to be running on a client JVM, the default applied by the JVM would be Serial GC (-XX:+UseSerialGC). On server-class machines (again, see the first section) the default would be Parallel GC (-XX:+UseParallelGC).
There is a lot more going on on with the defaults, such as PermGen sizing, different generation tweaks, GC pausing limits, etc. But in order to keep the size of the post under control, lets just stick with the aforementioned configurations. For the curious ones – you can read further about the defaults from the following materials:
Now lets see how our case study behaves. And whether we should trust the JVM with the decisions or jump in ourselves.
Our application at hand was an issue tracker. Namely JIRA. Which is a web application with a relational database in the back-end. Deployed on Tomcat. Behaving badly in one of our client environments. And not because of any leaks but due to the different configuration issues in deployment. This misbehaving configuration resulted in significant losses in both throughput and latency due to the extremely long-running GC pauses. We managed to help out the customer, but for privacy’s sake we are not going to cover the exact details here. But the case was good, so we went ahead and downloaded the JIRA by ourselves to demonstrate some of the concepts we discovered from this real-world case study.
What is extremely nice from Atlassian is that the guys have got some nicely packaged load tests shipping with it. So we had a benchmark to use for our configuration.
We carefully unboxed our newly acquired JIRA and installed it on a 64-bit Linux Amazon EC2 m1.medium instance. And ran the bundled tests. Without changing anything in the defaults. Which were set by the Atlassian team to -Xms256m -Xmx768m -XX:MaxPermSize=256m
During each run we have collected GC logs using -XX:+PrintGCTimeStamps -Xloggc:/tmp/gc.log -XX:+PrintGCDetails and analyzed this statistics with the help of GCViewer.
The results were not too bad actually. We ran the tests for an hour, and out of this we lost just 151 seconds to the garbage collection pauses. Or 4.2% of the total runtime. And on the single worst-case gc pause was 2 seconds. So GC pauses were affecting both throughput and latency of this particular application. But not too much. But enough to serve as the baseline for this case study – in our real-world customer the GC pauses were spanning up to 25 seconds.
Digging into the GC logs surfaced an immediate problem. Most of the Full GC’s run were caused by the PermGen size expanding over time. Logs demonstrated that in total around 155MB of PermGen were used during tests. So we have increased the initial size of the PermGen to a bit more than actually used by adding -XX:PermSize=170m to the startup scripts. This decreased the total accumulated pauses from 151 seconds to 134 seconds. And decreased the maximal latency from 2,000ms to 1,300ms.
Then we discovered something completely unexpected. The GC used by our JVM was in fact Serial GC. Which, if you have carefully followed our post should not be the case – 64-bit Linux machines should always be considered server-class machines and the GC used should be Parallel GC. But apparently this is not the case. Our best guess at this point was that – even though the JVM launches in server mode, it still selects the GC used based on the memory and cores available. And as this m1.medium instance has 3.75GB memory but only one virtual core, the GC chosen is still serial. But if any of you guys have more insights on the topic, we are eager to find out more.
Nevertheless we changed the algorithm to -XX:+UseParallelGC and re-ran the tests. Results – accumulated pauses decreased further to 92 seconds. Worst-case latency was also reduced to 1,200ms.
For the final test we attempted to try out Concurrent Mark and Sweep mode. But this algorithm failed completely on us – pauses were increased to 300 seconds and latency to more than 5,000ms. Here we gave up and decided to call it a night.
So just playing with two JVM startup parameters and spending few hours on configuration and interpretation of the results we had effectively increased the throughput and latency of the application. The absolute numbers might not sound too impressive – GC pauses reducing from 151 seconds to 92 seconds and worst-case latency from 2,000ms to 1,200ms, but lets bear in mind this was just a small test with only two configuration settings. And looking from the % point of view – hey, we have both improved the GC pause-related throughput and reduced the latency by 40%!
In any case – we now have one more case to show you that – performance tuning is all about setting the goals, measuring, tuning and measuring again. And maybe you are just as lucky as we and can make your users 40% happier by just changing two configuration options …
And – if you enjoyed this post then stay tuned for more and subscribe to either our RSS feed or Twitter stream to be alerted for more.
Nice article. Just one more tip: n By default ParallelGC is parallel on young gen, and serial on old gen. You should take a look to -XX:+UseParallelOldGC option. In machines with more than one CPU you would experiment a boost on performance.
If latency is your goal, the Parallel collector is NOT a suitable collector.nnCMS (with some tuning) or G1 are much better fits. nnCMS is a real PITA to tune – but for starters, you could try: nn-Xmx2g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=65 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=300000 -XX:+CMSScavengeBeforeRemarknn(basically a bunch of dark arts settings to try to stack the deck in favor of CMS being able to collect all the old gen stuff in the faster parts of the concurrent phase so your app never pauses… but in your case of the single core the concurrent phase will impact your performance since it can’t run on a different core than the rest of the stuff…) nnRuning any GC (even parallel) on a single core machine is going to be painful. CMS really does well when you have multiple cores to use because it makes use of them for the concurrent phase. nnAlso, when you go to CMS you ABSOLUTELY MUST give it more heap space – I’d up the Xmx to 2G if you can because CMS will behave very poorly if you don’t have enough overhead and it gets fragmented – and then has to do a stop the world compacting collection. nnI’d be very surprised if you couldn’t get the worst-case pause time down below 100ms with CMS (although, it probably depends upon the amazon instance you are on etc).
I absolutely agree with your comment. CMS was left out of this article with purpose to keep it’s size reasonable. I have made a couple of quick tests with CMS on c1.medium instance, but didn’t get enough time to fine tune it. CMS without tuning gave a lot of promotion failures and terrible performance as a result. And when I have tried to tweak just OccupancyFraction, then Young GC went too frequent. At this point of time I had to stop my experiments, as client was satisfied and my time was over. But I plan to do that yet.
There is some mis-information in this article. The “default” on Windows for a 64-bit JVM is “server”. In fact, on windows, the only way you can get a ‘client” JVM is if you install a 32-bit JRE . I believe even the 32-bit JDK will default to running as “server”.
Thanks for notifying, fixed