To blog |

A Day On-Call

June 5, 2019 by Gleb Smirnov Filed under: DevOps Monitoring

It is morning. You are at your workplace and your on-call shift has just begun. The chat bot helpfully posts this information in the team channel, including the time when the shift ends and noting the next engineer to take it over from you.

You take a look at the monitoring dashboards. There are no ongoing incidents. There are no alarms firing. The application has served over 120,000 users in the last 24 hours. 0.09% of user interactions have ended in failure, and over 6% of user engagement time was wasted waiting for stuff to load. That’s above normal, but the issue was already dealt with yesterday. Apparently, queries to a shared mysql database became slow due to an ongoing table optimization. A few even timed out. Ironically, the maintenance was meant to make queries faster.

Just as you switch to the infra dashboard, a warning pops up: one of the machines is running out of disk space. It’s predicted to completely run out of it in under 24 hours. Since you have not really gotten into any specific task yet, you check it out right away.

Apparently, the disk is mostly used by the temporary files accumulating in the data partition of the application. Deleting files that have not been accessed for more than 7d reduced the disk utilization in half.


$ du -h -d 1 data/ | sort -h
<...>
120G data/tmp 
140G data/
 
$ find data/tmp -atime +1 -type f -exec rm -rf {} +

That’s a lot of temp files, something must have been leaking them. Unfortunately, the files were already deleted, so it’s hard to determine where they came from. You register this leaking issue as a task to be looked into after a while and move on. There are no more warnings and the rest of the systems appear to be working fine.

You start on the infrastructure task you were working on yesterday, simple mysql queries like count(*) hanging for hours. Just as you are halfway through one of the blog posts describing a similar problem, you hear a sad trombone sound: that’s a production alert.

It is the same machine again, this time both Disk and CPU are being over-utilized. It is not immediately clear from the dashboard, so you just directly access the instance. You type in the diagnostics commands, checking with the cheat sheet, but they show no anomalous activity or resource utilization.


$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 7001056 222040 3454184    0    0     0     0 61501 55316 30 21 48  0  1
12  0      0 7000488 222040 3454404    0    0     0     0 56028 49327 32 20 47  0  0
 
 
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G  209M  3.0G   7% /run
/dev/xvda1       50G   27G   21G  56% /data

You glance at the dashboard again, and sure enough. The load has just subsided. The alert is automatically resolved. What was this, a fluke? This server is serving user traffic. You access the APM and see that no users have faced any errors in the last hour. So what could this have been? Maybe it was the log rotation, rotating something particularly big? That would match the patterns of short-term spikes of disk IO and CPU to read, compress and write large files. But does this instance even have local log rotation, or does it stream the logs for aggregation?

While you are looking around, the usage picks up again. You notice it from the corner of your eye at the auto-refresh dashboard you set up to track the usage in the background. You do not notice it from typing in the terminal, which is surprising – given a sustained near 100% utilization of all CPU cores, you would expect the terminal to get sluggish.

$ mpstat -P ALL 1
 
07:38:49 AM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  34.37  61.48   4.15    0.00   0.00   0.00    0.00    0.00    0.00   0.11

Huh, almost ⅔ of the usage is marked as “nice”. This could mean some threads are running low-priority workloads in the background. In just a few seconds, the load subsides. As you trace the load back to the process that is creating it, the usage spikes a few more times, each under a second. Short bursts of intense computation, blink and you miss them.

It turns out that the nice process is the main web application. Every now and then, threads from an unnamed thread pool are starting to churn through a massive amount of compute tasks. Besides the abstract algorithms like sorting and matrix multiplication, the code mostly comes from the convert sub-package.

A quick look into the source code points to the recently-added data conversion module. Equipped with this evidence, you reach out to the team that introduced this feature. As you open the chat, you see a message from one of the team’s engineers:

xxx: Hey, I think this disk and CPU usage alert may be our doing. We just confirmed the first few users opting in to the new data model. It is supposed to run much faster, though. Temp usage should also be capped at a few hundred megs… We will disable the conversion for now and investigate this. You can reassign the alert to me.

Whoops, it was sent full 10 minutes ago. You just did not pay attention to private messages during the investigation, so it went completely unnoticed.

You confirm your fellow engineer’s suspicions and hand over the additional evidence you have gathered. It becomes apparent that a quadratic algorithm may have been accidentally used. You also point out how the thread pool is missing a name, and how that made troubleshooting a bit more complicated.

A new build is shipped shortly after lunch, and the CPU usage and disk footprint are all successfully capped. More and more users are starting to convert, no alert goes off.

By the time you go to bed, not a single user interaction has yet failed during this shift.

No alerts wake you up, but just as you start cooking breakfast, the sad trombone sound goes off from your phone. You put down the kitchen utensils, wipe your hands and glance at the alert summary. Apparently, 18% of the users are facing errors from the web application in the last hour. Bad timing, but could have been worse. You turn off the stove and move to the computer.

The RUM dashboard reports that a new error, EofException appeared in the application, some sort of DeserializationException accounting for all of the errors. It appears to affect pretty much all the screens of some users, whereas some other users are completely unaffected.

You notify people about this incident, giving a quick summary:

Up to 18% of the users are facing errors on multiple screens. Other users are completely unaffected. We are investigating this incident and will provide a status update in 30 minutes.

You go through the exception and notice that 90%+ of the stack traces go through the deserializeV2  call that was added yesterday. But the codepaths starting from this frame diverge to many different codepaths, all leading to the same EofException somewhere in the depths. This does not look like a simple bug in figuring out how many bytes to read, this looks more like corrupted data. Argh. the other 10% of the stack traces are also related to the V2 deserialization, so that seems like the most likely hypothesis. You glance at the time. It’s been 15 minutes.

OK, you got the most likely root cause. What next? If this is indeed data corruption, then you need to contact the team responsible for this feature. But they may be still sleeping. Is there any way to work around this? Could pre-conversion data be used to serve user queries? Looking around in the code, you see that the decision on which deserialization algorithm to use is made based on an account properties. So you could flip this property, and if the old data is not destroyed during conversion, it should bring them back to the old data format. Given the time pressure, you decide to just try it on one of the accounts. You flip the property of the one that had the most recent activity, and sure enough: you can see in your APM that the user sees no more errors in these views after a few minutes, and proceeds to use the application smoothly.

Time to give an update:

Likely root cause identified, apparent workaround is applied to some of the users. The errors stopped for them. Will monitor a bit and apply workaround for all users if it works fine. Next update in 15 minutes.

And then, soon after:

Workaround applied to all users. Error has not reproduced in the last 10 minutes. Full post mortem to follow after underlying root cause is investigated.

You summarize all of your findings and the evidence you have captured in a document, constructing a timeline of this incident. The current impact is mitigated, and your shift will end soon. You coordinate with the engineer who will replace you, ensuring a smooth hand-off.

You finish your breakfast and go for a walk.

To learn how Plumbr can make daily life for on-call DevOps engineers a little bit easier, click here.

ADD COMMENT