To blog |

Planning for a risky infrastructure upgrade? Learn how to mitigate the risks.

December 17, 2019 by Gleb Smirnov Filed under: DevOps Monitoring Performance

In this post I will share the lessons that we have learned when conducting major infrastructure upgrades of our production services. I will be using an example of our recent database update which ended up improving the performance of the service by 10x and reducing the error rate by 20x. I will walk you through how we achieved this:

  • with no downtime for end users;
  • without any bugs or performance issues surfacing for end users;
  • with no overtime or working at 3AM for the engineering or operations teams.

Traditional approach

Whenever a major infrastructure upgrade is needed, the members of the IT operations team start shivering well before, remembering from first-hand experience how complex and risky the process is. Whenever a major database or virtual machine gets swapped, the process tends to look like this:

  • Test infrastructure is upgraded
  • Numerous (often manual) tests are run to verify that the functionality of the service is preserved
  • Performance tests are contemplated but often skipped due to lack of regularly maintained tests
  • Production upgrade is carried out during off-peak hours of the service, requiring the IT team to do work at wee hours
  • The upgrade will be shaky, introducing numerous issues in production, related to both availability and performance of the application.
  • Multiple weeks of panic mode ensues, requiring attention of the best engineers both from the operations and engineering side to patch all issues impacting real users.

So, it is completely natural to feel pain when preparing for the next major upgrade.

However, there is a way to relieve the pain and mitigate the risks. Bear with me and I will walk you through the set-up we have used on numerous occasions to upgrade our own infrastructure at Plumbr.

Case study – how we upgraded our Druid instance

I will be using the example of our last major upgrade where we migrated one of our primary Druid data storages from Druid 0.10.1 to Druid 0.15. The upgrade was triggered by gradual performance degradation of the deployment – the increased uptake of our APM and RUM solutions meant that we had more and more data in the storage, rendering certain queries to the storage slower and slower each month. This resulted in tail end latencies surpassing tolerable levels – 99th percentile of read operations was already exceeding 4 seconds.  Additionally, some known bugs of the old version started manifesting more frequently.

APM monitoring

Initially we tried to optimize and tune the existing instance but by September it was clear that we were at our wit’s end. The new and shiny Druid 0.15 seemed to be the real answer – there were numerous performance improvements implemented which seemed to be a real solution for the performance issues we were facing. So the migration project was summoned into existence.

Migration steps

As this was not the first time when we carried out such a major upgrade, we were able to build upon our knowledge from previous times. This meant that instead of the painful process described above, we:

  • Installed a new Druid 0.15 instance and deployed it in production, parallel to the existing Druid 0.10.1. Being the user of AWS cloud, it was as easy as throwing together a CloudFormation template.
  • Mirrored all the inserts in production to both the old and the new cluster
  • Gradually copied historical data from the old cluster to the new one
  • Mirrored the read operations to the new cluster as well in the background
  • Made sure that real users were still using the old Druid 0.10.1 data
  • Made sure both the new and the old Druid instance were monitored by our APM. This meant that every API call arriving to any of the Druid query brokers was monitored both for its duration and outcome.
  • Stabilized and optimized the new instance until it was either on par or better than the previous version
  • Flipped traffic to the new Druid instance and retired the old instance.

This process gave us an opportunity to have the new Druid version already working with production data and with production load while being monitored for performance and availability. Additionally, instead of surfacing the issues to our end users, we were able to iron out the detected issues behind the scenes.

It took us around ten days of parallel runtime until the error rate and latency in the new Druid instance were finally better than in the old instance. We used Plumbr APM to monitor the API calls in both of the instances in regards of error rate and latency distribution, so the objective and evidence based information was right in front of us. It took quite a few different patches to get rid of the bottlenecks and errors surfacing in the new Druid, but eventually the new instance stabilized.

After no new issues were detected for 72 hours, we flipped the switch and the real users started being serviced by the new Druid. After keeping a close eye on the new instance for the next 24 hours, we were able to pat ourselves on the back – the migration had taken place:

  • without any downtime for end users;
  • without any bugs or performance issues for end users;
  • no overtime or working at 3AM for the engineering or operations teams.

And, of course, we were able to enjoy the improved performance – the tail end latencies at 99th percentile were down 30%! In parallel to improved performance, the error rate decreased by more than 50%!

Improving performance using APM

We also faced an unexpected side-effect. Our AWS bill dropped by 21% – we were able to provision smaller instances for the new version, resulting the following drop in the AWS billing information

Lastly, the new Druid instance allowed us to apply additional optimizations that now became available. As of writing this post – the median response time is down 60% and the tail end latency at 99th percentile is 20x smaller than in September. In addition to improved performance, error rate has decreased by more than 90%!

Awesome performance

Note that we switched to the new Druid version way before all the optimizations were applied to keep the costs of running two parallel data sources at bay. The decision to migrate was done with the information extracted from the distributed traces captured by Plumbr APM – whenever the performance and availability of the new instance were equivalent (well, better) to the old instance, the new instance became the one servicing real users.

How to set up the mirroring?

The mirroring setup consists of two parts: writes and reads. The writes are fairly straightforward, since we just stream real time data via kafka. All we had to do here is set up the new cluster to index the same kafka topics:

Mirror write operations to druid using  kafka

Additionally, historical data had to be imported from the old cluster. In case of druid, copying the segment files to a new location in the deep storage and then copying segment metadata did the trick.

Mirroring the reads required introducing a new element into the infrastructure. Since the queries come via HTTP POST, a simple nginx-based proxy using the ngx_http_mirror_module was placed in front of the druid query brokers. The configuration looks like this:

location /druid/v2 {
     mirror /mirror;
     proxy_pass http://old.druid.internal;
 }
location = /mirror {
     internal;
     proxy_pass http://new.druid.internal$request_uri;
 }

So the resulting mirrored infrastructure for read operations hitting Druid data source looked similar to following:

Last but not least, we set up Plumbr APM to fork all the distributed traces at each druid broker into two independent traces, allowing us to get a side-by-side comparison of the old and new Druid cluster. You can read more about this feature on our support pages.

Take-away

When you are shivering from anxiety when looking at the next major upgrade on your roadmap, follow these simple steps:

  • Set up a new version of your infrastructure. Assuming that you have up-to-date and working deployment scripts, this will be trivial on any cloud provider. If you are running your own data center, this might be slightly more complex and/or resource constrained.
  • Make a copy of the data and make sure the new version can read and write to it.
  • Mirror the read and write operations to both  the new and old infrastructure
  • Monitor both the new and old instance using a good APM. If you do not have one at your fingertips, of course Plumbr is an awesome match for this.
  • Iron out the detected errors and bottlenecks until the error rate and latency of the new instance is at least on par with the old instance.
  • Flip the switch and divert real users to the new instance.
  • Monitor the situation for a few days to assure no unexpected issues surface
  • Retire the old instance

I do hope reading this post will save our dear readers hundreds of sleepless nights spent on upgrading and troubleshooting. If you are contemplating such a switch and are having doubts, feel free to contact us via support@plumbr.io and we will help streamline your next update!

ADD COMMENT