What is Terraform Drift?
Drift is the term for when the real-world state of your infrastructure differs from the state defined in your configuration.
― Christie Koehler, HashiCorp Blog
It can be caused by many things, but urgent hotfixes and teams being unfamiliar with Infrastructure as Code (IaC) practices is the most likely cause. Its a significant problem because it undermines the core benefits of IaC.
The main issue is the loss of a single source of truth. When your code and infra tell different stories, you can no longer trust your code to be an accurate representation of your environment.
Applying GitOps Principals to Infrastructure
GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation.
― GitLab Topics
The core idea is simple:
- Describe your entire desired infrastructure in a Git repository using declarative code.
- Automate the process that coninuously compares this desired state with the actual state of your live infrastructure.
- Correct any detected differences, ensuring the live environment always reflects the state defined in Git.
By adopting this workflow, you treat your infrastructure the same as your application code. Changes are made via pull requests, reviewed by peers, and automatically deployed. This creates an audit trail and, most importantly, provides a mechanism for automatically correcting drift.
Designing the Reconciliation Pipeline
To combat drift, we can build an automated reconciliation pipeline. This pipeline will periodically check for discrepancies and take action. Here’s two approaches:
- Detection-only
- The pipeline runs
terraform plan
on a schedule. If it detects any differences, it doesn’t apply them. Instead, it sends an alert to your team via Slack, creates a GitHub issue, or logs a warning. This approach is safer and gives your team full control over when and how to resolve the drift.
- The pipeline runs
- Auto-correction
- This is the full GitOps approach. The pipeline runs
terraform plan
to detect drift and, if any is found, immediately runsterraform apply
to automatically revert the infrastructure to the state defined in your code. This ensures your infrastructure is always in sync, but this requires a high degree of confidence in your automation and testing.
- This is the full GitOps approach. The pipeline runs
Building the Pipeline with GitHub Actions
Our goal is to create the auto-correction pipeline that runs on a schedule, checks for drift, and automatically applies the correct config if any drift is found.
Create a new file in your repository at .github/workflows/terraform-reconcile.yml
and add the following code:
|
|
This workflow builds a custom output (steps.plan.outputs.drift_detected
) during the plan
step that is used by the other steps.
I tried using the -detailed-exitcode flag for terraform plan
in combination with conditionals for posting summaries, but GitHub Actions must have some sort of bug, because it didn’t output the correct exit codes.
Best Practices and Considerations
Before deploying this in a production environment, consider the following:
- Start with
detection-only
- Begin by removing the
terraform apply
step and adding a notification step instead. Let the pipeline run for a while to see how often drift occurs.
- Begin by removing the
- Limit scope
- Initially, run this pipeline only on non-critical environments, like development.
- Secure your credentials
- OIDC is the most secure way to authenticate with your provider, as it provides short-lived, automatically rotating credentials.
- Use notifications
- Even with auto-correction, you can still notify your team when drift is detected and corrected. Consider adding a step to send a message to a Slack channel to keep everyone informed.
- Role-based access control (RBAC)
- The user that did a manual change (ClickOps), do they really need permission to do so?
Conclusion
Drift is natural when managing complex systems, expecially in immature environments where processes haven’t been fully developed. By implementing an automated reconciliation pipeline, you eliminate configuration drift, and create a more stable and predictable environment.