Balancing releases and quality in your applications.
The most successful software engineering teams maintain high performance and availability standards. They also have to release various new features in order to make their software more useful.
It is often observed that releases bring instability to products. Little has been done to quantify this instability or characterize the errors. With Plumbr monitoring, we attempted to do this. We ran a few experiments to collect data and analyze them. Our findings were as follows:
1. Some errors appear immediately after releases.
We considered data from one of our own applications. monitoring data for the past six months. During this time, our team made 136 such releases. Our goal with the data analysis was to determine if the “release window” was followed by increased error rates afterward. We first made some measurements to establish a statistical baseline of error rates over time. The average error rate over the six month period was found to be 0.8%.
Using a 3-hour time window following each release, we attempted to calculate the same ratio. Within each window, we calculated the same ratio of failed to total user interactions and expressed it as a percentage. Consolidating the figure over all the 136 such time windows, the error rate came up to be 1.1% (± 0.05%).
Next, we consolidated and calculated the error rate in the time window excluding the three-hours after every release. During this period, the consolidated error rate was 0.7% (± 0.04%). A 1/8th reduction from the baseline. Also, a significantly lower error rate than during releases.
What helps us identify high error rates right after a release, you may ask. Here’s our methodology:
(i) We build situational awareness about errors using Plumbr. Our teams are made aware of availability and performance objectives on a per-service basis.
(ii) We have an alerting mechanism that alerts us to any violations in these parameters.
2. Some errors manifest gradually.
We took data spanning 3 months from 165 JVMs, each serving over 10 million API calls. During this time, there were over 300 restart events recorded. By inspecting the duration between restarts, and the duration before errors appear, we found out that the median case for:
- Time until a new error is faced by a user to be over 40 hours.
- Time elapsed between shipping the defect and the fix to be 167 hours.
You might find yourself in a similar situation, where errors and error rates are on the rise. This could be from faulty code rolled out to production several days/weeks ago. How do you handle it?
(i) You will need a list where performance and availability objectives have been violated. This list has to be ordered by their impact on users/usage.
(ii) Next, you will need evidence which will allow for speedy root cause analysis. This will reduce the time and effort taken to repair the applications.
Improvement in software performance is a long but rewarding process. The customer success team at Plumbr is focused on helping you on this journey. We are committed to assisting you in achieving your performance goals and making your software fast and reliable for your users. We’ve identified four impactful areas, namely monitoring, alerting, prioritizing by impact, and evidence-based root cause analysis – to help you uncover and cope with performance issues on your application.
Please reach out to us at email@example.com. We’re happy to draft an action plan to tackle your performance issues head on.