To blog Previous post | Next post
How big should your on-call DevOps engineering team be?
Pretty much every devops engineer has worked for a company that has, or has started to, set up an on-call engineer rotation. From a business point of view, it’s entirely necessary to have someone available to quickly resolve any incident management issues that are negatively affecting customers. Let’s be honest, your software has bugs so it makes sense to be able to respond to it.
But from a human point of view, being an on-call DevOps engineer is horrific! It’s a constant stress knowing you might be woken up at 4am for something trivial. Or worse still, sleeping through an alarm for a major incident! On top of that, focusing on normal work becomes almost impossible due to minor interruptions that take you out of the zone. That’s why you need to pay careful attention to how you build out your on-call DevOps engineer team if you want to stay on top of incident management issues.
What can go wrong?
The dangers of an ill-prepared plan should be obvious to anyone who has a Pavlovian response to their PagerDuty beep. It increases your anxiety, it disrupts your work, it destroys your sleep and it will eventually lead to burnout
For the person keeping one eye on business performance then you should also be aware that issues can end up being ignored, routine work gets put to one side and critical knowledge becomes siloed.
So you should probably hire a few dedicated engineers to on-call incidents?
There is a reason that not many companies have a team of 1-2 on-call engineers, and it mainly comes down to a question of size. Imagine you are supporting a site that runs 24/7 with a global outreach, and devices running into the low thousands. While you’re product engineers are left free to do the cool stuff, the on-call engineers will be getting PagerDuty alerts at least 3-4 times a week. It’s not a sustainable model and your incident management responses will suffer greatly.
So let’s put all our DevOps engineers on-call and share the burden with 25 people
In theory a large team could work. Everyone is on-call once a month and then a back-up another day. But theory doesn’t always equal reality. In practice, we see that two major problems come about with that approach.
1 – If you are on-call only once a month, what are the odds that you can get accustomed to the discipline required to answer incidents? For small issues this might not be a problem, but when you’re faced with a major emergency you probably won’t know what to do. Being on-call is not just about having skills, it’s also about coping with pressure and acting quickly. These aren’t the kinds of talents that an engineer can easily learn one day a month.
2- When you’re on a large team like this and face an unfamiliar issue, you can always decide that it’s someone else’s problem. We all know that it’s hard to disseminate knowledge around a large group of people, so when you get an alert from an unfamiliar service, the chances of you being able to adequately deal with it decrease.
The sweet spot: Your on-call team should consist of 5-6 people
At Plumbr, we see the best fit as being 5-6 people. Across a weekly rotation this means that every single person is guaranteed time off and shares an equal burden on being on-call and on backup. This way the common stresses and interruptions to work can easily be minimized.
Returning to the business side, client’s can rest easy knowing that someone is taking responsibility there and then and owning the issue.
I’ve got 5-6 people on my team already! So I’m good, right?
Not having a well structured on-call team leads to stress, but it is really a symptom of a wider problem. Getting constant alerts probably means something is structurally wrong and if you can’t identify it, it doesn’t really matter how big or small your on-call team is. The key to effective incident management is to make sure you’re being alerted to issues that actually matter.With a good real-user monitoring app, on-call engineers can discover whether an issue is customer facing and highlight how many people are affected by the same error. By separating signals and noise, you can enjoy your work without interruptions and focus on the important things.
Interested in finding out how you can support engineers with the right post-mortem culture?