How it is made: Plumbr Edition
“If you like laws and sausages, you should never watch either one being made.”
There are many domains where the quote attributed to Otto von Bismarck holds. I do hope it is not true for software, as in this post I am going to share the insights about how we are building and releasing our performance monitoring solution called Plumbr.
In the following chapters we are going to walk through the way how our automated tests, continuous building and continuous delivery help us to automate the following process:
To the readers not familiar with Plumbr architecture, some background is needed. From the build & release standpoint, Plumbr consists of several software components. This article will talk about two of them:
- Plumbr Agent. A downloadable agent attached to the JVM. Agents collect information from within the JVM and send it to Server.
- Plumbr Server. Central data collector and the user interface for the digested data. The UI is served via a web-based front-end.
Plumbr as a company has no dedicated QA team. Our philosophy from the very beginning has been that all essential testing must be automated. Every developer should be able to run all tests on his local machine without additional setup. As a result of the approach taken, the deliverables are ready to released as soon as all automated tests pass.
The philosophy extends also to the decision of what, when and how needs to be covered by tests. In most cases, the developer’s judgement during the initial feature building is enough, but two additional safety nets compensate for potential misjudgements:
- All our branches go through a mandatory code review. In around 5% of the times, the reviewer requests additional automated verification to be added.
- In case of regressions slipping through, the feedback from production dictates the demand for additional tests.
For unit testing we use the wonderful Spock library. These tests cover the majority of our code. Several developers practise some forms of TDD, but this is not a requirement, so those who prefer to do it the other way around can also follow the path they have chosen. These tests have rather decent instruction coverage. At the time of writing the post, our last release reported 974 tests ran against our Agent with 83% coverage and 1,016 tests resulting in 76% instruction coverage in the Server component.
Apart from usual unit tests we have a separate layer of acceptance/integration tests. Our web applications have a suite of browser-based tests written with the Selenide library. They verify all the key use cases through simulating end user interaction on our web site.
Acceptance tests for Agent are different and in a way more complicated. Agents are meant to run in the environments our customers have chosen in their infrastructure. As a result of this, we need to support different platforms:
- operating systems. Different versions and flavours of Linux, Windows and Mac are being covered;
- Java versions. Many different Java 6, 7 and 8 versions are being tested;
- Java vendors. We test on Oracle Hotspot, IBM j9 and OpenJDK
- Application Servers. We also test compatibility with different application servers and runtime environments, such as Tomcat, Wildfly, IBM WAS and Play runtime
All in all we test Agent for approximately 500 different combinations of above. The number of combinations is growing quickly. Just two years ago we were testing on “mere” 120 combinations.
We have a separate job in our continuous integration server Jenkins, which is constantly polling the BitBucket. Whenever a pull request is created, the job checks out that branch from the BitBucket and runs “gradlew build” to execute the Gradle build. At this time all our unit tests are run, a subset of Selenide tests are run and the simplest Agent acceptance test is run. This all serves as a quick check that branch is mostly in good shape. If everything is Ok, that Jenkins Job approves the pull request.
When pull request is finally merged into the trunk, Jenkins notices it and starts the job specific to the module the pull request affected. This job runs on the latest version of the trunk and compiles all sources, runs unit tests and packages the distribution file. The distribution unit is a war file for the web applications and a zip file for the Agent. The job runs the same build script that developers run on the local machine. As a result this, the failures on this job are rare. Every failure thus attracts the immediate attention from the development team. At the end of the build the assembled distribution file is published to the Artifactory binary repository.
The build steps in previous chapter are common for all our deployment modules. The delivery steps to production are different for the Agent and Server as seen in next chapters.
Plumbr Server is distributed as a single war file with one a single external dependency on MySQL database. Our customers, who cannot use our SaaS offering, can download this bundle and deploy it into their environment. Again due to different preferences for infrastructure by our customers the released module has to be compatible with the different versions of the MySQL database. Verifying the compliance is responsibility of the Jenkins job preparing for the release.
Plumbr Server release job runs automatically every morning and starts by downloading the pre-assembled distribution WAR from the Artifactory repository. This distribution is then deployed parallely into different environments backed by different MySQL versions.
Now a Gradle script is run on every environment, executing our suit of Selenide tests. If all tests pass in all environment then the version qualifies for production. It is automatically deployed to our own SaaS production machines. In addition, the release is made available for the users preferring to install the Server themselves via our Download Center. As a final step, the released bundle is moved to a separate release repository in the Artifactory.
As you may recall, we have to test our Agent on almost 500 different combinations of operating systems, Java versions/vendors and application servers. At first look it might look truly scary, especially when you imagine all these 500 machines eating into your infrastructure budget.
Reality it is not that scary. We have made some sacrifices in the build throughput and are actually using just eight (virtual) machines to do the job. From those eight, two are residing in Amazon AWS where we run some of our Windows builds.
Mac OS X builds on the contrary are run on actual hardware in our own server room. In the corner of this room we have a tiny Mac Mini. This was the first piece of hardware purchased by our company years ago. Being a good investment, it still serves us all right.
All remaining Windows and different Linux builds are also carried out in our own physical infrastructure. In there, Windows machines live inside VirtualBoxes and Linux servers inside Docker containers.
All these machines run Jenkins slaves and are used to carry out Agent acceptance tests. Here we have made another trade-off to achieve faster delivery. As the trade-off, we run a smaller subset consisting of 30 of the most popular combinations out of the entire 500 every night.
Each acceptance test on each platform downloads the latest published Agent version from our Artifactory repository. Next, the desired required application server or runtime environment (such as Play framework) are being downloaded from the same Artifactory. The infrastructure is installed and tests are ran with our Agent attached to the JVM.
The goal of the acceptance tests is to verify the Agent capturing transactions and connecting the correct root causes to the problematic transactions. This tends to result in the following nice and tidy picture after about 20 minutes:
To compensate the trade-off we made, we run all 500 combinations weekly. The process is exactly the same, only the matrix is “somewhat” larger:
The gray cells that you see denote impossible or infeasible combinations. As examples – you cannot run Wildfly8 on jdk6 and nobody in their right mind would run JBoss EAP 6 with OpenJDK6 on Mac.
This job runs for around 15 hours due to the trade-off in number of machines available for the job. It is also quite flaky, resulting in many false negatives in the results. To compensate the flakiness, we rerun failed cells a couple of time to iron out infrastructure-induced false negatives and then investigate the remaining failures.
After we are satisfied with the acceptance tests results we manually launch the Agent release job in Jenkins. This job uploads new version to our Download Center and publishes it to the release repository in the Artifactory. As the last step,the job makes sure we are using the latest version to monitor ourselves by automatically updating Agents that we use to monitor our own production JVMs.
Is this really enough?
If you noticed, we do not have any manual testing embedded into the build-test-release process. So how good of a job the automated tests are really doing in guarding us from slipping critical bugs into production?
To be honest, bugs still slip through. With the Server component, we harvested six critical bugs making their way into our production servers during the last year. These bugs are discovered fast – we eat our own dog food and monitor our Servers with Plumbr, so we are alerted whenever something is not behaving as it should. Coupling this with simple and automated release process means we were able to patch these in under two hours. So, we are not exactly Netflix yet, but we are getting there.
Our Agents are more complicated. The struggle with the large & flaky multi-environment matrix means that for certain configurations bugs slip through more often. Again, from the data in our JIRA we identified 11 critical bugs last year where a particular infrastructure combination contained critical bugs in the Agent-side.
As our Agents also talk to our Server, we also had means to capture such problems early. The Agents connecting to SaaS version of our service also report to us when certain bytecode instrumentations were not behaving as they should or whenever the native code changes were causing issues. When such issues are detected, we immediately pull broken version from our Download Center, inform the impacted client(s) and proceed to release a fix in one or two days.
On the other hand, we failed to count the number of bugs caught by the automated test. Last week alone the tests captured 14 different flaws in the Agent and Server before these made their way into our customer’s environments. There is no way and zero chance that human QA would have been able to keep up with the robots in this field.
The major benefit we get from the process above is the absence of human bottlenecks in our release pipeline. We don’t wait while QA tests and approves version. We don’t require a separate build engineer or deploy manager to perform the actual release. Every developer is capable of pushing new version or critical fix into production at any time of day or night. No human intervention, no human error.
Coupling the continuous build-test-release with the right monitoring tools, I can confirm that I sleep calmly during nights. I am confident that our software is working as expected and the rare errors are caught fast.
When you are not yet using the elements I covered in the post in your build & release pipeline or do not monitor your production deployments, I can only recommend to start now. The increased quality and improved time to production deployments are well worth it.