Squeezing data into the data structure
This story is about a capacity optimization task that we recently carried out at Plumbr. It all started with an innocent-looking requirement being added to the existing mix.
As you might know, Plumbr monitoring solution is distributed as a Java Agent which connects to a Server. The small addition required to keep track of all the connected Agents over time so that questions such as following could be answered in real time:
- For how long have we not heard from this particular JVM?
- What was the last known downtime of that other JVM?
As each of the Agents is sending out a heartbeat every second, all that we need to do in the server-side is to keep track of all the heartbeats. As each heartbeat has a unique timestamp attached, the naive solution would be as easy as tossing all the heartbeats in a Set or a Map. So – easy, done, next, please?
However, some quick back-of-the-envelope math demonstrated that the initial idea just might not work. Taking into account that:
- a timestamp is of type long and requires 8 bytes to accommodate itself
- in a year there are 365 x 24 x 60 x 60 = 31,536,000 seconds
we can quickly do the math and see that the raw data alone for a single JVM for one year would require 240MB. The size of the raw data alone was scary enough, but when packaged to a HashSet the retained size of the structure exploded to about 2GB with all the overhead java.util.Collection API implementations hide in their belly.
The naive solution was off the table and we needed an alternative. We did not have to look very far initially, as in the very same java.util package a surprise called java.util.BitSet was waiting to be discovered. According to the javadoc of the class:
The BitSet class implements a vector of bits that grows as needed. Each component of the bit set has a boolean value. The bits of a BitSet are indexed by nonnegative integers. Individual indexed bits can be examined, set, or cleared.
So what if we stored the heartbeat acquired from the Agent as boolean values indexed by the heartbeat’s timestamp? Timestamps in Java are represented as the difference in milliseconds between the current time and midnight, January 1, 1970 UTC. Knowing this, we can represent the 1st of September, 2015, 12:00 UTC as the number 1441108800. So what if when we see an Agent sending us a heartbeat at timestamp 1441108800 we would set the bit with the index 1441108800 to true, otherwise being left as the default false?
The issue with the solution is hidden in the fact that bits in a BitSet are indexed by integer instead of long. To proceed with this solution, we would thus need a way to map the integers to long without losing any information. If it seems impossible then let’s look back at the fact that the precision of a second instead of a millisecond was needed. Knowing this, we can shrink the index 1,000x and stamp the time with the precision of a second instead of a millisecond.
But how many seconds can be represented using just Integers? Apparently Integer.MAX_VALUE is big enough to represent each second from 01.01.1970 to until 19.01.2038. Besides creating a year-2038 problem, it should be good enough, right?
Unfortunately, as our back-of the-napkin calculations show, one year worth of data would still require around 800MB of heap. This is a small step in the right direction from the original 2GB of the HashSet, but still way too much for practical use.
To overcome the problem, one might need to re-read/re-think the part that said “enough to represent each second from 01.01.1970”. (Un)fortunately mr. Gosling did not invent the Java Virtual Machine until 1995. And Plumbr itself saw the light 18 more years later. Consequently, we do not need to rewind the history until 1970 and have a bunch of zeroes padding each integer. Instead of starting from 01.01.1970, we can start with 01.01.2013 and have a bit with index 0 to correspond to 01.01.2013 00:00 (UTC).
Redoing our back-of the napkin math and checking the results in practice gave us a winner. Now a year’s worth of data could be stored in only 20MB. Comparing this with the original 2GB we have shrunk the needed capacity by 100x. This was already in the comfort zone as the existing infrastructure was able to cope with it, so we did not go further down the optimization path.
Moral of the story? When you have a requirement in your hands, find out what it might mean in terms of your application’s performance. And I mean all aspects of performance as there is more than just latency and throughput, one should not forget about capacity. And – know your domain. Without it, you cannot make decisions which, if just equipped with book-smarts about data structures, seem insecure and dangerous.