Handling a major release: our experience
Anyone who has spent time in software development can bear witness to a situation where some major version upgrade turned into a nightmare for the stakeholders. Large releases are often postponed, or released sans some important features; and newly released software is often riddled with bugs.
In this article I will describe two techniques we used at Plumbr to successfully release a major upgrade to our Plumbr Java Performance Monitoring solution, without getting burned by the usual fires.
The foundation for success was built on two main pillars:
- Continuous Delivery: The major release was not living in some separate code branch for all of the months it took to build it. Instead, our next big change was constantly merged to the Mercurial default branch and released to production daily.
- Eating our own dog food: We used Plumbr to monitor our own Plumbr release. Monitoring our own Java Virtual Machines with continuously deployed changes in production gave us rapid feedback flow throughout the development cycle.
We will now discuss those two pillars.
The new version of the application, even though constantly deployed in production, remained hidden from our customers for several months. It was deployed on the same VM, embedded in the same WAR file and, to a large extent, even accessed the same data structures as the customer-facing version of the application.
The decision of which version of Plumbr to display to the user was based on a check of whether one specific role was present for the user or not. If the role was present, the user got access to the new version of Plumbr. If not, they were directed to the then-current version.
Routing users was possibly the easiest part to implement; much trickier was figuring out how to support two different set of requirements in the data structure. In our case, the difference in requirements was significant enough to (seemingly) dictate the need for a different type of storage for certain data structures.
In retrospect, the solution to the problem proved to be truly elegant. After designing the new data structures, we migrated the existing data to the new data structures. After the migration, we proxied all write operations to insert data to both old and new data structures. This resulted in two datasets, one supporting previous version of the application, the other accepting reads from the new solution.
Next in line was something near and dear to the heart of every developer, namely API design. As we already had a clear contract between the front and back ends via our API, we just needed to provide a version 2 of the API. The API for the new solution was completely decoupled from the previous version, essentially providing two different views of the data. This approach allowed us to build new features one by one, while supporting the previous version via the soon-to-be-deprecated API.
From this point on, it was already a downhill battle. All we needed was to deliver implementations for the API one by one, releasing the new features to production day after day.
The product approached “feature complete” fast, and a full month before the release we were already able to use it internally and even allow access to a subset of customers.. So besides being technically satisfying, the approach also allowed smooth validation of the solution along the way.
The second pillar of a successful release was made possible by what we call the “meta-solution” situation, an Alice in Wonderland kind of paradox that occurs when you build monitoring solutions that you can use to monitor your own services.
To give you an idea how this was beneficial to us, let me describe the solution we were building in a few words. Plumbr is designed to detect slow and failing user transactions in an application, and automatically link such transactions to the root cause in the source code. Building such a solution meant that the task of testing and especially performance testing new code was reduced to processing the alerts triggered by Plumbr (the instance that was monitoring Plumbr) and fixing the exposed root causes as they appeared during the development process.
To give you an idea about the solution, here is a screenshot from the Plumbr production deployment monitoring our own services. The screenshot is taken after we received an alert about performance degradation in production:
We immediately noticed the unusually high number of transactions (1,836 to be precise) being slower than determined by the SLA. Instead of the usual situation where just a handful of transactions would have been flagged as slow, we were facing close to 1% of the transactions completing slower than tolerated based on the SLA.
After being equipped with the macro-level impact, we needed to determine the functionality in our application that was suffering. Plumbr groups transactions by the endpoint (called “services”) that the transactions are consuming. For example, we know that some of our application transactions access the URL /account/12jk112/services and /account/92as982/services. These transactions are accessing the same functionality and are automatically grouped together under the same service. By checking the list of services, we immediately exposed the fact that a vast majority of the 1,836 slow transactions were consuming a single service:
Out of 182 total services monitored by Plumbr during the period,ServiceController.getAccountServices() was the one causing 93% of the slow transactions during the period. Out of the 1,836 slow transactions, 1,710 were consuming exactly this service, providing the content of the list seen in the screen above. So here we have our example of software monitoring itself and detecting bugs in itself …
But the sweetest spot was still a couple more clicks away. When we opened the service detail view, it was again immediately apparent that all those 1,710 transactions were slow thanks to a single expensive JDBC operation monitored and exposed by Plumbr:
From the above we could see that the particular service had indeed been too slow 1,710 times because of a particular JDBC query being poorly constructed. From that point on it was easy – having both the offending query and the line in source code where the query was constructed, we patched the problem immediately.
Harvesting through our release backlog I can currently count no less than 57 tickets (out of a total of 480) that were created and resolved because of monitoring Plumbr with Plumbr. Our software monitors JVMs for a variety of performance issues, and we found nearly everything, including numerous expensive JDBC operations, a handful of lock contention issues, and loads of mishandled exceptions and GC pauses. We even ran into an OutOfMemoryError once.
Besides ironing out bugs, eating our own dog food over three months also resulted in an extremely tight product feedback loop, resulting in turn in several major improvements to the product itself. For example a new Plumbr feature provides the ability to equip prepared JDBC statements like the one above with the exact query parameters that resulted in poor performance. If you check the numbers, 95.2% of the transactions consuming this service were still performing nicely, meaning the root cause exposed itself only when certain conditions were fulfilled.
Everything is Slow
We were moving ahead at an almost unbelievable rate for a while. This was too good to be true, and there indeed was a problem down the road, just waiting to break our necks. If you recall, the requirements for the new version seemed to dictate the need for a completely different type of storage. That was awesome news for us techies because we got to play with shiny new things.
This time the shiny new thing was a particular NoSQL database. After several weeks of proxied writes to the new data structures, the performance of the production application started to degrade. It became steadily worse each day, in spite of our constant optimization attempts.
Seven weeks before the release it became obvious that continuing with that technology would put the entire release at risk, so we formed a SWAT team to build another back-end, this time using the good-old relational storage solution. This meant adding yet another storage technology behind the proxy, after which the writes were directed to the current storage used by end users, the NoSQL storage, which we were still trying to optimize, and a fallback solution that the SWAT team was rapidly building.
After a calendar week we stood at a decision point: NoSQL or relational? In comparing the performance of the noSQL solution, built and optimized for several months, with the quick-and-dirty relational backend, it was clear that even in its premature and un-optimized state, the relational storage was there to stay
Within one more calendar week, the last remnants of our endeavor to the NoSQL land were gone from the code base and the project was back on track.
The combination of these two approaches contributed to a situation I had never experienced in my 15-year career in the software industry. A whole ten hours before the biggest release of my life I was staring at the JIRA backlog and not believing my eyes. The only task remaining in the backlog was named “Release”.
No bugs, no pending features, no developers craving to get some sleep in the office, just me sitting and refreshing the backlog, still not believing my eyes.
Apparently this situation is so uncommon that JIRA, where we keep our backlog, decided to enter the “hey, this user has to be a newbie as he has only one ticket left in his backlog” mode and began offering me all kinds of extra tooltips and help banners:
This post was originally published by InfoQ in the InfoQ Articles section.