Quality is paramount for any engineering team. Any drop in quality affects users and can end up costing your product dearly. It is inevitable that applications will contain errors. We’ve documented the effects of errors and how best to handle them on our blog previously.
New releases often bring instability to the product. Most engineering teams will know from experience that when new features are built and pushed to production, there is a high likelihood of an increase in user-visible errors. The difficult question to answer is – if you’re deploying continuously, are you bringing instability continuously too?
Some errors are detected early
We considered one of our applications’ monitoring data for the past six months. During this time, our team made 136 releases. Our goal with the data analysis was to determine if the “release window” was followed by increased error rates. We first made some measurements to establish a statistical baseline of error rates over time. The average error rate over the six month period was found to be 0.8%.
Since we typically ship hotfixes in under 3 hours, we reviewed the 3-hour time window following each release. Within each window, we calculated the ratio of failed to total user interactions and expressed it as a percentage. Consolidating the figure over all the 136 such time windows, the error rate came up to be 1.1% (± 0.05%). A 63% jump in error rates means that the service will be less reliable than normal.
Next, we consolidated and calculated the error rate in the time window excluding the three-hours after every release. During this period, the consolidated error rate was 0.7% (± 0.04%). A 1/8th reduction from the error rate overall. Also, this is significantly lower than the error rates that were observed during releases.
Therefore, despite having several stages where the software is checked for quality, new releases tend to contribute to instability. How do you cope with these situations as an engineering team?
- Build situational awareness. Be aware of what availability and performance objectives you would like to have on a per-service basis.
- Enable a mechanism that will help alert you to any violations in these parameters.
Some errors take a long time to manifest.
Having looked at a single rapidly and frequently deployed application, let us now look at the bigger picture. We took data spanning 3 months from 165 JVMs, each serving over 10 million API calls. During this time, there were over 300 restart events recorded. As an approximation, we assumed that usually, a restart means a release. By inspecting the duration between restarts, and the duration before errors appear, we found out that the median case for:
- Time until a new error is faced by a user to be over 40 hours.
- Time elapsed between shipping the defect and the fix to be 167 hours.
We derived these numbers from observing the frequency of restarts of JVMs. It implies that when there are teams that are focused on the quality of applications (QA, release engineering) all the obvious bugs are identified before a release. Otherwise, new errors would surface in much lesser time.
What does this mean? First – significant time elapses between the time releases are made and teams become aware of even the first occurrence of a failure. Even if engineers are made aware of first instances of failure, they do not know objectively if they should fix the failure or not, since they lack the evidence in the form of impact. If the error is allowed to persist there is the risk of massive degradation in experience. Also, time elapsed means loss of context. Any error report requires reproduction of errors, evidence of operating environment variations, and other parameters – gathering all of which is a costly experience.
You might find yourself in a situation where errors are on the rise. The errors could be from faulty code rolled out to production several days/weeks ago. How do you handle it?
- You need a list where performance and availability objectives have been violated. This list has to be ordered by their impact on users/usage.
- Next, you will need evidence which will allow for speedy root cause analysis. This will reduce the time and effort taken to repair the applications.
The answer to all of the issues lies in having a good monitoring solution. With Plumbr, we aim to provide you with these benefits, and more! Sign up for a trial if you’d like to experience the benefits first-hand. Alternatively, you can request for a demo of Plumbr from one of our technical experts.