How DevOps and development in general can benefit from a new way of looking at provisioning and configuring systems by viewing common terminology in a new light.
I recently watched a presentation on configuration management tools and techniques, which sparked an idea worth exploring.
Everyone agrees that tools for infrastructure-as-code configuration management are essential and valuable, but at the same time, the frustration with using these tools is relatively high. What if the existing mechanisms for configuration management or infrastructure-as-code were no longer such a dumpster fire or such a pain to use? What would it look like?
This article explores how DevOps and software development, in general, can benefit from a new way of looking at provisioning and configuring systems by explaining common terminology in a new light.
Let’s start with some terminology that will help make sense of the world around us, and approach some of the problems so many are experiencing.
Terminology explained
Software engineering: Let’s define this as the sum of actions used to build and create products in software and technology. The tasks employed by software engineers usually include writing code, checking for the correctness of that code and making it available for users as a system.
Operations: The practice of ensuring that software systems are in a good working state and serving user needs. The operations engineer is usually less concerned with writing new code; they are primarily concerned with how existing software behaves in an existing system.
Having defined these two terms this way, we can now explain how “forward engineering” is different from “reverse engineering”.
Forward engineering
Forward engineering is building a system based on a model of what should exist in that system. FE usually creates the model first, then deploys and configures that template into operation. The expectation is that the actual resulting system will closely match the model.
In many cases, the declared model is ambiguous, and the system might behave in unexpected ways that were not the intention of the initial model. That’s already a significant obstacle in using such techniques, especially for newcomers.
To add insult to injury, the system itself will often differ from the model due to interaction with its environment, for example, when multiple parts change the configuration of each other. Or when ad-hoc procedures performed by engineers shift the system in various ways.
In forward engineering, there is no fundamental notion of feedback for the initial model. The model is “set in stone” first and deployed after. Any feedback later received from the system about its behavior can be used for new evolutions of future models and future deployments.
Reverse engineering
Reverse engineering is developing a partial model from an existing system and using that as the basis for system modifications. There is no initial model, and the system’s current state enables the discovery of how things should work.
Making changes to such a system, in many cases, is done ad-hoc directly on that system, often without a complete understanding of the whole system. Here, the feedback is immediate, and exploring a system is mainly done by poking it and looking at the generated feedback.
Where do we find reverse engineering? You have probably heard that security experts are using RE practices to dig into existing software to find holes in its security. For example, they scan network services or memory to find abnormal behavior in a system and exploit it by making the system behave unintendedly.
Reverse engineering is all about looking at an existing system or software and, based on what it is doing in practice, try to understand where it has gaps with the original intent for that software.
Where do we find the above ideas in the world of DevOps?
Infrastructure as code
Ever since the first version of CFEngine was released back in 1993, the idea of modeling infrastructure-as-code had an explosive popularity rise in the software engineering industry. It started with declaring how a server operating system content should look and was quickly adapted to describe the layout and configuration of cloud resources today. Companies and tools have come in and out of favor over the years, such as Puppet, Chef, Salt, Ansible, CloudFormation, Terraform, CDK, Pulumi and numerous other homebrew solutions.
What is common across most of the tools mentioned is that they are declarative. Declarative tools allow us only to specify how things should be without being explicit about the steps on how to get there. In recent years several tools have surfaced, claiming to be more imperative. However, they are still primarily declarative and different from traditional imperative software development languages used to write code.
Why are most of these tools declarative?
To answer simply, because this is the easiest way of creating maps, models and templates. An engineer only has to declare how things should look and where each item should be, and delegate the hard work of figuring out how to place each item into the right shape and form. It allows for iterative improvement over time and eventually leads to more robust and stable systems.
Another common advantage of declarative infrastructure-as-code tools is the property of idempotence. Idempotence means repeating an action multiple times with a consistent result. When using an idempotent tool to change a system numerous times, the system will eventually arrive at the state declared by the template. Using the same template again after that will have no more effect on the system. Idempotent is the property that allows us to apply the model multiple times and only have the system change when differences from the model are present.
An imperative approach would require first finding all the differences across the model and system, and then implementing each step required for each change. Declarative infrastructure-as-code tools achieve this without the developer having to write all the actions themselves.
Infrastructure-as-code tools are popular and vital; they save thousands of person-hours by creating templates that can be used repeatedly and creating replicas of almost-identical systems across different environments with minimal marginal effort.
Infrastructure-as-code promises that iterative improvement of the model over time leads to robust and stable systems based on this model.
A previous article of mine about configuration for multiple systems explains how replicas of almost-identical systems are beneficial for a software businesses.
But! These tools use a forward engineering mindset, as these tools are declarative and have no mechanism to receive feedback from a live system. This kind of approach birthed many problems and complaints over the years from the users of these tools, which we can explore.
Problems with forward-only engineering infra-as-code
Have you ever heard the term Configuration Drift? It happens when the declared model is no longer matching the state of a system. Every system, given enough updates, will almost always be different from the model used to create it.
Drift can happen when a developer changes the model code without updating all the systems built using that model, or when an engineer does exploratory ad-hoc operations and changes a system without going back to the template to update its code. Both activities are essential: Developers do the former to introduce improvement into future iterations of deployments, and often operations engineers do the later to discover unknown problems and address them.
Configuration drift is, of course, a common land-mine waiting to happen, which is why everyone is saying that you should avoid having configuration drift. But is it realistic to ban operators from operating the systems to disallow any ad-hoc exploration? Yes. Some companies adopt a mandate that no operator or developer can touch a “living production” system. You can imagine what such a policy is doing to the Mean-time-to-Recover metric of such systems. From personal experience, WTSHTF and a “production” system breaks for any reason, this is the first rule that gets thrown out the window, and ad-hoc exploration is immediately allowed for anyone who can figure out what is going on.
Engineers lured into adopting infrastructure-as-code complain about too much work remodeling an existing system from scratch. There is always someone creating a tool to help with this. Google has GCP Terraformer, AWS has the discontinued AWS CloudFormer, even Azure has its ARM sprinkled all over its cloud console. There is always someone building such a tool for your favorite configuration management language because there is just such a high demand for this.
But unfortunately, once an engineer has used such a tool, they are usually disappointed. The result is either too much noise that makes no sense or, at best, is already obsolete a day after. The best one can do with such a reverse-engineered template is copy-paste snippets out of it into a manually written template somewhere else.
Improving the way forward
This article defined two terms and explained how forward-only engineering is suboptimal for the actual operations of systems and how, in most cases, reverse-engineering is still mandatory to find and resolve problems.
Now the proposal is to create a new breed of Infrastructure-as-Code tools, tools that integrate reverse engineering as their center point and allow feedback from living systems to update the models that made them. An operator can either adopt the ad-hoc change or reject it to revert the system to its state as declared in the original model.
Existing industry-standard tools fail at helping engineers solve problems that arise with system operations. Yes, several tools exist that can audit your Terraform templates and notify you of mistakes or misconfiguration. These audits are helpful, but the actual value of such help is inspecting an existing living system, not just the template engineers once used to create it.
As far as I know, we don’t have a good enough tool that allows integrating more reverse engineering practices into the daily work of operations engineers. We have a lot of monitoring and observability tools on one side and many infrastructure-as-code tools on the other, with a significant gap in between.
Do you have ideas on how we can fill that gap? Be sure to let us know!
Cover image created by Tabea Schimpf at Unsplash.