To blog Previous post | Next post
Underutilized Collections wasting memory
Plumbr has been recognized as memory leak hunters. But our goal has never been to stick with just the leak hunting. Your feedback about how easy it is to solve memory leaks with Plumbr made us wonder what if we could take the same approach we used to solve memory leaks and expand to other similar problem domains?
And indeed we have progressed in several other areas. In this article we can open up one of the research domains we have been working. Namely – Collection API overhead. As one picture is worth more than the words describing it, let us start with a graph illustrating one aspect of the problem:
This diagram is plotted using data from more than 500 different applications, all of which use thousands of Collection API class instances under the hood. What we have outlined in the diagram above is the collection utilization ratio. You see for example that more than 35% of the collections created are completely empty. And that approximately 9% of the collections are half-empty. Or half full if you are one of the rare optimists out there.
More interesting is the aggregate of the facts:
- 72% of the collections are utilized to 50% or less
- 90% of the collections are utilized to 75% or less
- And just 5% of the collections are fully utilized to the full capacity
But this data would just serve as a fun trivia without digging further and trying to understand the reasons and impact beyond the facts.
First lets look at the reasons. If you recall our recent post about Collection overhead, we analyzed the memory consumption of an ArrayList where we covered the internals of the ArrayList:
- When you initialize an ArrayList via its default new ArrayList() constructor, you have created a list with capacity to accommodate 10 elements. The size of the list is zero, as no elements have been yet added.
- When adding elements to the list, the capacity of underlying array is dynamically increased at certain thresholds.
Lets see how the increasing occurs by extracting the relevant parts from the source code of the ArrayList:
public boolean add(E e) {
ensureCapacityInternal(size + 1); // Increments modCount!!
elementData[size++] = e;
return true;
}
private void ensureCapacityInternal(int minCapacity) {
modCount++;
if (minCapacity - elementData.length > 0)
grow(minCapacity);
}
private void grow(int minCapacity) {
int oldCapacity = elementData.length;
int newCapacity = oldCapacity + (oldCapacity >> 1);
//cut for simplicity
elementData = Arrays.copyOf(elementData, newCapacity);
}
As seen from the JDK API above, the list capacity is being checked when new elements are added. If the size of the list is equal to the capacity, the underlying array is being copied into a new, larger one. This increase happens at 50% of the previous capacity. So for example when our original list has grown to 10 elements, adding a single new object grows the list to a capacity of 15. And when our list capacity is 1,000,000 and a single new object is added, we would have a list with a capacity of 1,500,000
So here we have the reason for the Collections being underutilized staring right into our face in the form of the JDK source code. But how severe is the impact of this behaviour?
Did you know that 20% of Java applications have memory leaks? Don’t kill your application – instead find and fix leaks with Plumbr in minutes.
In order to estimate this impact, we need to turn to the basic math of calculating how much space do those empty references require. It can be platform dependent and you can read more on how to estimate and/or calculate this from this post, but more often than not 4 bytes is required for each of those references. So applying this math to the case where you grew your ArrayList from 1,000,000 to 1,500,000 you have just wasted nearly 500,000 times 4 = 2,000,000 bytes to empty references.
Getting rid of this excess heap consumption is trivial – either initialize your collections to the expected sizes by using the ArrayList(int initialCapacity) instead of the default or invoke trimToSize() when you have populated the list you need.
This ~2MB might not seem much though – at least when trying to recall the last time when you actually created a collection accommodating more than 1,000,000 elements.
On the other hand – have you given a thought about
- How many smaller collections you have in place in your application?
- How many collections do your 3rd party libraries use under the hood?
- What would be the aggregate amount of memory lost to this excess reserve?
- And what if you could just discover and remove this waste easily?
If you are eager to be among the first ones to get your hands on to the solution, stay tuned – we are working towards making this data transparent. Equipped with this knowledge, we bet you are the best to decide whether you need to take action or not.
Comments
0.75 is default load factor for hashmap, which also backs lots of other collections. Which explains little to no utilization between 75-100%.
Would be cool to see statistics for lists and maps independently 🙂
We are currently working on extracting more useful data in these parts. Lots of surprises ahead, I am sure 🙂