Common benchmarking pitfalls
Every once in awhile each and every one of us is trying to measure performance of certain parts of the code. It can be either a fully functional application measured against specific number of user transactions per minute or a microbenchmark. Which you have written to prove that your fellow developer is wrong.
Especially in the latter case it is beneficial to be aware of the pitfalls skewing your benchmark results. Looking into the benchmarks I have created ~seven years ago, I have to admit I have been proving my point on several cases I was actually incorrect. My apologies, dear friends. But I have learned something along the way and think it is worth sharing. So that you would not end up in the same stupid situation.
Pitfall 1: noisy neighbours. When running a benchmark in your own PC, you would not launch a DVD-RIP next to it. But what about the torrents you forgot in the background? Or Windows pulling the next update? Problems like this have skewed my test results several times.
So one should conduct benchmarking only on server. But be aware that it can become even more tricky with all the virtualized platforms out there. When you launch your next virtual instance for a test drive, there is a chance that you are sharing the same physical hardware with somebody else. And now if the resources are either tough to virtualize (such as I/O) or you have elasticity issues, such as short intense bursts in CPU followed by hypervisor-enforced throttling – you have a problem. At least when trying to achieve predictable outcome on your benchmark.
Pitfall 2: garbage collection pauses. Measuring the performance on a garbage collected environment has to take the garbage collection pauses into account as well. So – whenever you are running a benchmark, it is recommended that you turn on the garbage collector logging via -XX:+PrintGCDetails. Now when monitoring the logs:
0.415: [GC 0.415: [ParNew: 19136K->2112K(19136K), 0.0255607 secs] 100164K->100154K(118108K), 0.0256139 secs] [Times: user=0.04 sys=0.01, real=0.03 secs] 0.451: [GC 0.451: [ParNew: 19136K->19136K(19136K), 0.0000286 secs]0.451: [CMS0.461: [CMS-concurrent-mark: 0.033/0.099 secs] [Times: user=0.12 sys=0.04, real=0.10 secs] 0.580: [Full GC 0.580: [CMS: 107774K->107774K(107776K), 0.0553739 secs] 126910K->126910K(126912K), [CMS Perm : 4620K->4620K(21248K)], 0.0554400 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 0.635: [Full GC 0.635: [CMS: 107774K->107775K(107776K), 0.0919800 secs] 126910K->126872K(126912K), [CMS Perm : 4620K->4609K(21248K)], 0.0920382 secs] [Times: user=0.09 sys=0.00, real=0.09 secs]
you can estimate the impact on your benchmark. As we can see from the sample above, the program has been running just for 635ms and out of this time we have spent 270ms (42.5%) waiting for the GC to do its job. Now, I am not saying here that your tests should be configured to avoid GC pauses. I am just warning that in cases where you are testing for example the behaviour of large data structures you should watch out that your tests will not become GC-bound at some point.
Pitfall 3: Just In Time compilation. Modern JVM’s are amazing in the sense that the JVM optimizes the code during runtime. So in most cases you would not like to jump into performance-related conclusions before JIT compilation has finished its duties. To be aware what JIT is doing during your benchmark you would need to add -XX:+PrintCompilation to your JVM arguments and monitor the logs:
240 org.codehaus.groovy.reflection.CachedMethod::compareToMethod (125 bytes) 241 java.util.zip.ZipFile::access$300 (5 bytes) 242 java.util.ArrayList::indexOf (67 bytes) 243 java.util.Arrays::fill (28 bytes) 72 made zombie sun.misc.URLClassPath::getLoader (136 bytes)
On most cases, the trick here is to wait until the JIT has finished it’s optimization tasks and switch on the monitoring from that point. For example, as seen from our last blog post, the results can vary up to 100x performance-wise.
Pitfall 4: Classloaders. Be aware of the classloading effects. JVM is a lazy bastard and will not load classes before he really needs them. For the benchmarks it means that you should not turn on your measurements before the classloading has finished. For microbenchmarks it is oftentimes not too difficult – having a warmup phase is all you are going to need. But when benchmarking things in real world you might wish to switch on -verbose:class flag to check whether all classes were loaded before your measurements were turned on.
Pitfall 5: Machine Architecture. If you are benchmarking something you are planning to use in production code you should run the test on the same architecture your production environment is using. Otherwise you could get completely skewed results – either due to the 32/64 bit difference, GC algorithm difference, heap region configurations, number of cores, etc.
Pitfall 6: Measuring. First is the unit of measurement. If your results are measured in nanoseconds then chances are you are in the wrong territory. Try to create more coarsely grained tests so that you can go into milliseconds at least. Better yet – in tens of seconds. Measuring really small units of work can again skew the results – measuring itself might change the outcome too much. Second – measure several times. Actually, tens of times if you can. And I do not mean that you iterate 30x over the benchmark in your code. I mean you launch the whole benchmark 30 times and aggregate the results.
Now if everything here scared you away from writing microbenchmarks – good. Because most often your microbenchmarks do not help you out with actual performance issues in your application. Just write simple, clear code that avoids clever optimizations. This is the code that JITs of the present and future are likely to optimize themselves. And that’s a job which truly should be theirs, not yours.