Feedback and Monitoring in DevOps
The goal with today’s post is to further reaffirm the notion that Plumbr, as we envision it, is a key ingredient in a good DevOps recipe. It is among the few tools that can provide real-user monitoring, a very valuable way of closing the loop for engineers. I’ve managed to collect ten important quotes from across ten important books about DevOps that have been authored in the past few years. These excerpts highlight how there is a void in the tooling that we can fill. An unintended side-effect of this post is a “List of DevOps Books To Read”.
In the book Effective Devops, author Jennifer Davis asks the question – “How can you tell the difference between a company trying to sell a solution that could be effective in your environment versus a company trying to get on the trend of devops?”  Well, there are tools that just integrate into existing workflows. On the other hand, there are tools that require engineers to make some accommodations. The first kind helps improve the efficacy of the DevOps methodology. The second attempts to introduce new layers, new methods, new processes, and additional overhead. And this can help anyone tell the difference between the the two kinds of companies. If the core premise of the software is acceptable, it is not a sale anymore. If this premise does not resonate with you, there is no sale either.
One of the favorite reads among everyone at Plumbr is the Site Reliability Engineering book by the team from Google. In the chapter on ‘Monitoring’ they write, “… white-box monitoring does not provide a full picture of the system being monitored; relying solely upon white-box monitoring means that you aren’t aware of what the users see. You only see the queries that arrive at the target…”. Modern practices of software development hold the concept of feedback as very central. Of what benefit is a feedback mechanism, which does not involve production instances and actual usage? Synthetics don’t cut it. Regression testing, complex testing infrastructure, and complicated delivery pipelines are necessary, but have been found grossly insufficient.
In the book Starting And Scaling DevOps In The Enterprise, author Gary Gruver recounts this rather familiar tale – “Every time we pointed customer traffic to the new code base, it would start running out of memory and crashing. After several tries and collecting some data, we had to spend several hours rolling back to the old version of the applications. We knew where the defect existed, but even as we tried debugging the issues, we couldn’t reproduce it in our test environments.”  An overwhelming question greets every engineer as they begin to solve production issues – where to begin troubleshooting? When interviewing many engineers, the unanimous answer is, “Logs!”.
Logs, like everything else in software, are subject to a dichotomy. There is a definite need for them as evidence of transactions. However, logs can build up too quickly, become a mess, and need to be implemented with care. Two important problems arise when relying on logs as evidence of user interactions. First, it is never easy to assimilate. In his DevOps Adoption Playbook, author Sanjeev Sharma notes – “What is essential is to ensure that the feedback be provided in a form that is consumable by the stakeholders it is being provided to. Providing logs to business analysts does not provide much value to them. However, analytics related to the root causes of spikes in the usage of particular features, or changes in user behavior based on a configuration or UI change, are very relevant to the business analysts.” 
A second and rather serious problem manifests in the event a team decides to use microservices. “Monitoring one server is easy. Monitoring many services on a single server poses some difficulties. Monitoring many services on many servers requires a whole new way of thinking and a new set of tools. As you start embracing microservices, containers, and clusters, the number of deployed containers will begin increasing rapidly. The same holds true for servers that form the cluster. We cannot, anymore, log into a node and look at logs. There are too many logs to look at. On top of that, they are distributed among many servers. While yesterday we had two instances of a service deployed on a single server, tomorrow we might have eight instances deployed to six servers.”  When there is variation in the core infrastructure, it becomes impossible to have a dependency on logs. Let’s face it – microservice architectures are here to stay. Consequently, this problem is unlikely to go away soon. A loosely coupled real-user monitoring system that keeps tabs of user interactions, and is agnostic to the server configurations is an easy way to solve this problem. Evidence of the failure, or performance bottleneck is preserved along with the context of the user.
The foundation for the right notion of feedback is available in the book Leading The Transformation. The authors note, “The objective here is to change the feedback process so that there is a real-time process that helps them improve. Additionally, you want to move this feedback from simply validating the code to making sure it will work efficiently in production so you can get everyone focused on delivering value all the way to the customer.”  Only this singular focus on the promise of delivering value to each user binds various teams together. Which is what software engineering teams should monitor, measure, and make better.
Another seminal DevOps book, reinforces this idea. “End-user service is what really matters, so checking that everything is working correctly from the user’s point of view is essential.”, notes Kief Morris in Infrastructure as Code. Unfortunately, for the industry as a whole, the first response to this statement is to install Synthetics – software that lets you simulate users. There are very few teams that attempt to approach this problem from the other end of the spectrum, which is to measure the problem space from the perspective of real users.
Another common story echoed in the hallways is the inability to pinpoint breakage. In the book Practical DevOps, author Joakim Verona observes, “… your software will probably break for other reasons altogether than those you prepared for. If your system breaks, it can be very expensive for your organization, either in lost revenue or in terms of lost credibility …”  Now, the problems above (reliance on logs, inability to reproduce, relevance to stakeholders) compound this situation further. The best way to mitigate the risk from this operation that could affect all software engineering teams is to minimize the variables.
At enterprise scale, NFRs like auditability of interactions become an important facet. Authors Jez Humble and David Farley write in their book Continuous Delivery that a pivotal stage in software maturity is the ability to have all interactions with the system audited. Only in the presence of robust real-user monitoring tools, can we aim to have a means to achieve this.
In summary, a world which is slowly erasing the borders between development and operations, needs a better way to measure performance. “A successful measure of performance should have two key characteristics. First, it should focus on a global outcome to ensure teams aren’t pitted against each other… Second, our measure should focus on outcomes not output …”, write authors Nicole Forsgren, Jez Humble, and Gene Kim in their recent book Accelerate.
Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale, O’Reilly Media, 2016, ISBN 9781491926307
Site Reliability Engineering: How Google Runs Production Systems, O’Reilly Media, Incorporated, 2016, ISBN 9781491929124
Start and Scaling Devops in the Enterprise, BookBaby, 2016, ISBN 9781483583587
The DevOps Adoption Playbook: A Guide to Adopting DevOps in a Multi-Speed IT Enterprise, Wiley, 2017, ISBN 9781119308744
The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices, CreateSpace Independent Publishing Platform, 2016, ISBN 9781523917440
Leading the Transformation: Applying Agile and DevOps Principles at Scale, IT Revolution Press, 2015, ISBN 9781942788010
Infrastructure as Code: Managing Servers in the Cloud, O’Reilly Media, 2016, ISBN 9781491924358
Practical DevOps, Packt Publishing, 2016, ISBN 9781785882876
Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, Addison-Wesley Professional, 2010, ISBN 9780321601919
Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations, IT Revolution Press, 2018, ISBN 9781942788331