GC impact on throughput and latency
One type of the problems each and every Java application out there has to wrestle with is related to garbage collection. When the garbage collector works, it represents a wonderful invention. When it does not – or when the way GC is doing its housekeeping becomes unpredictable – then you have a friend who has turned into a foe.
This post is about garbage collection pause times. Or more precisely – why should you care about the pauses.
- We have a factory line producing one iPad per second. Each second, every second. So the throughput of the line is 86,400 iPads/day
- It takes four hours to complete an iPad from the start where the casing is molded to the finish when the acceptance tests on the iPad have been concluded. So the latency of the line is four hours.
The system above and the calculations are based on the assumption that the factory line is operational 24 hours a day, each day, every day. But all factory lines tend to need maintenance which is equivalent to garbage collection running inside the JVM.
As an example – lets take small maintenance tasks, which can be handled without much interruptions. Examples could involve adding oil to the machinery or picking up excess trash from the floor next to the molding equipment. Those operations are similar to minor GC’s within the JVM – it is maintenance you have to deal with, but the implementation is so clever that the performance of the system is not affected.
But in the very same factory mr. Tim Cook is going to face long-lasting maintenance tasks as well. Those tasks involve stopping the whole production line and are equivalent to the Full GC runs, where the JVM needs to stop servicing the threads in order to do some important housekeeping tasks.
Did you know that GC stops 20% of Java applications regularly for more than 5 seconds? Don’t spoil the user experience – increase GC efficiency with Plumbr instead.
Now, lets assume that after months of uninterrupted service, our hypothetical factory line gets jammed and the tech team takes four hours to resolve the issue. During this period the line is stopped. How do we measure the effect? As always, the impact can be measured by two different means:
- Impact on throughput. The four-hour stop means we have 14,400 seconds during which no iPads are completed. Throughput-wise it means we have reduced the system’s capacity in this particular day from 86,400 to 72,000. Which means approximately ~16.5% loss in throughput.
- Impact on latency. Now, if we took an iPad which was still on the line when the interruption occurred, it took not four but eight hours to complete. This represents a 100% increase in worst-case latency.
If you recall then mr. Cook did not care about latency. What was important for him was the overall throughput during a longer period, so mr. Cook would decide to optimize his processes in a way that the impact on throughput would be minimized.
Similar decisions need to be made in software development as well. If you have a Java EE application responsible for order processing, then a GC pause spanning four seconds would definitely reduce the throughput of your system. But for most of us it will not be a major issue. On the other hand, the users who were trying to accomplish things during the four-second stop-the-world-i-have-cleaning-to-do pause would get a perception that our systems are sluggish. And operating a service which is perceived by users as sluggish is a darn good way to go out of business.
The morale of the story? Pick your goals wisely and make sure you do not confuse throughput with latency. Then make sure you understand how GC can affect either of those by monitoring your GC logs, looking for unexpected Full GCs and tuning your application and/or GC to minimize their impact.