Published on February 21, 2026

A Practical Guide on how to Structure Terraform Refactoring in Azure.

Many Terraform projects start lean and end up as monoliths. This guide explains how to migrate Azure infrastructure to modular states, clear module boundaries, and resilient CI/CD governance without downtime.

Mertkan Henden

Software Engineer

Microsoft Azure

Cloud

Terraform

Infrastructure as Code

Introduction

Almost every Terraform project starts the same way: a main.tf, a few Azure resources, and quick wins. After a few months, it often evolves into an unmanageable monolith with long plan runtimes, global state locks, and increasing risk with every change.

In enterprise environments, this is not a cosmetic issue but an operational risk. When network, IAM, data platform, and workload deployment live in the same state, every small adjustment becomes a potential major disruption.

This guide presents a proven refactoring strategy for Azure with clear module boundaries, a resilient state structure, migration without downtime, governance integration, and CI/CD safeguards.

Why Refactoring in Azure Often Starts Too Late

Typical warning signs in mature Terraform landscapes:

State lock bottlenecks: One team blocks another because everything lives under the same backend key.
Unclear ownership: Platform and application teams modify the same files.
Drift between environments: dev, staging, and prod are no longer functionally comparable.
Compliance risk: Changes are difficult to trace, and roles are not cleanly separated.
Increasing recovery time: In case of failed deployments, it is unclear which resources are actually affected.

Target Architecture: Structure by Lifecycle and Blast Radius

The most important design decision is: Do not split by Azure services, but by lifecycle and blast radius.

The first area to address is the state structure. My recommendation for larger Azure setups is separation:

per environment (dev, staging, prod)
per domain/component (for example, network, platform, identity, workload-x)
dedicated backend key per state

terraform {
  backend "azurerm" {
    subscription_id      = "xyz"
    resource_group_name  = "rg-tfstate-prod"
    storage_account_name = "sttfstateprod001"
    container_name       = "tfstate"
    key                  = "prod/network/core.tfstate"
  }
}

This makes changes smaller, parallelizable, and auditable.

The second area of focus is establishing clear module boundaries. Meaningful module boundaries should align with a shared lifecycle and change cadence:

Network baseline: VNet, subnets, NSGs, UDRs
Security baseline: Key Vault, managed identities, policy assignments
Workload modules: App Service, AKS, Container Apps, including their required dependencies
Observability: Log Analytics, diagnostic settings, alerts

In some cases, I even go as far as including the associated identity of an AKS cluster, its RBAC permissions, and related dependencies within the same AKS module. We explicitly want to avoid "GOD modules," but a healthy grouping can make sense when the primary resource—such as an AKS cluster—is functionally useless without its supporting resources.

The third area of focus is orchestration across stages. For example, environments/prod calls the same modules as environments/dev, but with different input values.

This approach prevents configuration drift and significantly reduces maintenance effort when changes are required.

This approach depends entirely on the successful mirroring of stages. Once business units require special handling for individual stages, workarounds such as feature flags become necessary. This will inevitably become harder to maintain over time.

The fourth area of focus is a migration strategy without downtime. Refactoring production infrastructure is a controlled rebuild—not a big-bang event. Consumers and users of the infrastructure should ideally not even notice that a refactoring has taken place. Zero downtime is the key objective.

State File Migration Approach

Phase 1: Discovery and Safety Net

In the first phase of the migration, the focus is on achieving full transparency into the current state while establishing a reliable safety net. First, the existing Terraform state is backed up using terraform state pull. This backup should be versioned and additionally stored in encrypted form to ensure both traceability and protection of sensitive information.

Next, the existing resources are grouped based on their criticality and function. It is recommended to initially analyze especially sensitive or foundational components—such as networking resources—in a strictly read-only context to avoid unintended changes and clearly identify dependencies.

At the same time, evaluate which resources carry a risk of requiring recreation during migration (Force-Recreate). Maintenance windows should be planned only for these specific cases to minimize operational impact and avoid unnecessary service interruptions.

The key success metric for each migration step is that a subsequent Terraform plan shows no changes (Plan = No changes). This ensures that the state was migrated correctly and fully matches the real infrastructure.

Phase 2: Build Modules and Mirror Exactly

Before executing the first terraform state mv, it is critical that the new module fully and precisely mirrors the existing behavior. This means that all relevant properties of the existing resources must be replicated exactly to avoid unintended changes or resource recreation.

In particular, resource names must match exactly, as any deviation may cause Terraform to interpret resources as new. Additionally, all default values in the new module must align with the previous configuration—even if those values were previously implicit. All tags must also be preserved without modification to ensure consistency for governance, billing, and automation.

Another key aspect is correctly modeling all dependencies between resources. These dependencies must be defined in the new module exactly as they were in the previous structure so Terraform can correctly understand relationships and avoid unintended changes or reordering of infrastructure state.

Only when the new module fully and identically reflects the existing behavior can the state migration be performed safely and without side effects.

Example commands for reference:

terraform state mv \
  'azurerm_subnet.app' \
  'module.network.azurerm_subnet.this["app"]'
 
terraform plan
# Ziel: No changes

If any changes appear, stop immediately, correct the module parameters, and validate again.

Phase 3: State Splitting with Clear Interface Boundaries

Once the newly introduced modules operate reliably and produce consistent results, individual domains can be gradually migrated into separate Terraform states. The goal of this state splitting is to logically decouple the infrastructure, improve maintainability, and limit changes to clearly defined ownership boundaries.

It is particularly important that dependencies between states are handled cleanly and in a controlled manner through defined interfaces. Prefer explicit outputs from upstream states for this purpose. These outputs serve as well-defined handover points and ensure that downstream states receive only the information they actually require. The use of terraform_remote_state should be intentional and minimized to avoid unnecessary coupling and complexity.

A key principle is the strict avoidance of circular dependencies between states. Such bidirectional dependencies create fragile structures, can block Terraform executions, and may result in unpredictable behavior.

A typical example of a clean handover is providing a subnet ID from the network domain state (network) to a downstream workload domain state (workload). The subnet ID is defined as an output in the network state and then consumed as an input variable in the workload state. This keeps ownership boundaries clear while enabling controlled and transparent integration.

Phase 4: Parallelization in CI/CD

After splitting the infrastructure into separate states, CI/CD pipelines should also be structured along domain boundaries. Instead of a single centralized pipeline managing all resources, independent pipelines are created per domain—for example, plan-network-prod, plan-platform-prod, or plan-workload-a-prod. Each pipeline is responsible only for planning and validating its assigned infrastructure components.

This domain-based separation significantly shortens feedback cycles, since changes affect only the relevant domain instead of the entire infrastructure. It also prevents teams from blocking each other due to shared state locks or conflicting changes. As a result, both delivery speed and organizational scalability improve, allowing multiple teams to work independently and in parallel.

Governance in Azure: Not Optional

Refactoring Terraform primarily improves structure and readability. However, without consistently enforced governance, it does not automatically result in a more stable, secure, or controlled platform. Sustainable architecture quality is achieved only when rules are not just documented but technically enforced.

A core component is implementing Azure Policy as Code. This enables systematic enforcement of organizational and security requirements, such as restricting allowed regions, enforcing mandatory tags, or requiring Private Endpoints. These controls ensure that all resources comply with the same standards regardless of the responsible team.

Equally important is enforcing the RBAC least privilege principle for every pipeline and state. Each execution unit should receive only the permissions strictly required for its task. This reduces the risk of unintended changes and limits the blast radius of potential failures.

Additionally, mandatory tagging standards must be established and enforced. Consistent tags such as cost center, criticality, or resource owner enable clear ownership tracking, improve cost management, and support operational and governance processes.

Another essential practice is strict version pinning for providers and modules. Explicit version definitions ensure reproducibility and prevent unintended changes caused by provider or module updates.

Automated policy and compliance checks should also be integrated into the pull request process using tools such as Checkov or Open Policy Agent (OPA). These checks detect governance violations early and prevent non-compliant changes from being merged.

This ensures that architecture quality is not just an abstract goal, but is enforced and sustained through technical controls.

CI/CD Integration: Reference Workflow

A robust pipeline flow per state:

terraform fmt -check
terraform init -backend=false
terraform validate
tflint + security scan (Checkov/Tfsec)
terraform plan (store as artifact)
Manual approval for prod
terraform apply only using the approved plan

Additionally recommended:

Drift detection as a nightly read-only job
Mandatory platform team reviewers for prod
Blocking applies when provider versions are not pinned

Common Failure Patterns in Real Projects

1) "Harmless" module rename: Renaming a resource block without a moved block results in destroy/create plans.

2) State splitting too early: Splitting before module interfaces are stable creates fragile cross-state dependencies.

3) Misusing remote state as a database: Excessive implicit dependencies make changes unpredictable.

4) Pipeline without plan artifact binding: If apply does not use the exact approved plan, change control is significantly weakened.

5) Governance applied too late: Without early policy checks, violations are discovered only shortly before go-live.

Real-World Example (Azure)

In a regulated customer environment, approximately 300 resources were migrated from a monolithic state into six domain-specific states.

The results included:

Plan times reduced from ~40 minutes to 4–8 minutes per domain
Parallel deployments for multiple teams without lock conflicts
Significantly faster root cause analysis during incidents
Clearly auditable changes aligned with ownership boundaries

The key success factor was not just the technical migration, but the combination of state architecture, stable module APIs, and governance enforcement within the pipeline.

Conclusion

Terraform refactoring in Azure is an architectural responsibility with operational impact. The greatest leverage comes from three key decisions:

Structure states based on blast radius and ownership boundaries
Define modules as stable contracts
Perform migration incrementally with strict no-change plan validation

This transforms a risky monolithic structure into a scalable IaC platform that accelerates teams instead of slowing them down.

If you need support refactoring your Azure infrastructure, learn more about our Cloud Audit Service.

View All Posts

A Practical Guide on how to Structure Terraform Refactoring in Azure.

Interested in Working Together?