Cleaning up terraform repos with layers
- EN
- RU
Table of Contents
Terraform is the most popular infrastructure-as-code framework. I wasn’t able to find a good source of information on Terraform’s market share. The closest source is a report from 6Sense that says Terraform has 37% of the market. This market share doesn’t include OpenTofu, which is a fork of Terraform. It also compares Terraform to Ansible (30%), which is a bit of a weird comparison. Anyway, Terraform has a majority of the IaC market. In most of the cloud projects you are currently working on, there will be Terraform one way or another.
#
The problem
Terraform is not the most convenient tool, even if it is the most popular. One of the advantages is Terraform utilizes providers. This approach allows you to extend Terraform easily by adding new types of resources. You are still using HCL and relatively the same code to manage resources for different providers. It also allows you to transfer data from one kind of resource to another one. For example, you have an application in a Kubernetes cluster that utilizes OAuth2. To use OAuth2, you need to create an application in your OAuth2 provider (such as Okta or Azure EntraID). Then you need to pass the client ID, tenant ID, and client secret (or PKCE URL) to your application. You need to create a secret in Kubernetes so application can use those credentials to start utilizing the OAuth client. Terraform allows you to do it because it operates one large state. All variables in the code are globally visible, so you can use it for your convenience.
##
Terraform state
Everything Terraform works with it stores in the state file (tfstate). It is an enormous JSON document that Terraform unmarshals to the Go structure in memory on start. The first problem of the state is its size. The size of the state is roughly equal to the size of the object multiplied by the amount of objects your Terraform operates. The size of the exact object is defined by the provider and, through it, by the API that contains all possible properties of the object. For example, Google is very careful with the API design; the resources only contain properties absolutely required. On the other hand, Microsoft or Oracle are not that meticulous in the API. Just have a look on oci_core_instance to see how many unclear and non-obvious the setup can be. All the properties of the object must be in the state. That causes state inflation, the more objects you have, the bigger the state will be. Large state consumes memory, because you cannot read the part of the state. You have to read the whole file to unmarshall it and place it in the memory as it is. I’ve seen infrastructures that required 24Gbs of RAM just to run terraform plan. Otherwise, terraform job fails due to out-of-memory error. The bonus problem hides in the terraform reconciliation. Terraform refreshes objects in state as a part of run cycle. The more objects you have the longer it takes. Some public APIs have rate limits and other security measures to avoid DoS attacks. Some of the public APIs may be simply slow (infamous Azure Entra ID, I’m talking to you). Sometimes you have to wait more than 10 minutes to have everything in order.
##
Blast radius
Terraform initially designed to be used by individuals, not teams. The HCL (HashiCorp Language used by Terraform to describe state you want to achieve) doesn’t have internal protection mechanisms common to normal languages. It doesn’t have proper encapsulation, all variables are global and may be re-defined or changed anywhere in the system. A small change may affect unpredictable places, causing significant infrastructure damage. Of course there is a terraform plan shows you the changes terraform going to apply. However, it is easy to miss a critical change. Especially in the case of many object changes same time or if you use AI code review. People tend to trust other people’s judgement and not paying enough attention to the tool output. And the result may be catastrophic.
Terraform doesn’t perform changes by itself, it uses providers. Each provider has the configuration and performs the authentication. Of course, you must grant the provider with the permissions necessary to perform the tasks you use terraform for. By default, you have a provider per kind of resource (for example, to configure Azure or Google Cloud). Any mistake done with this provider is done with elevated credentials of the provider. It is possible to implement least privilege approach by creating multiple providers and specify the provider for a specific group of resources. However, it is not really useful – people tend to use the simplest approach possible and in the end everything will be done by one privileged provider. You can’t remove default provider. You can’t prohibit something to highly-qualified people (and your colleagues, I’m sure, are highly qualified). Smart people will find their way one way or another.
##
Bonus-track: security
Terraform was not designed to implement security. The state file contains any object terraform operates. All this data is not encrypted. Any time you generate encryption keys, pass API credentials, certificates or any kind of secret information using dependency or data you save this information in state. This information is not encrypted. Any person with state access can pull any data from state by using terraform state pull or just copying the file. What is worse there is no way to limit access to a specific area of the state. The operator (the terraform) requires the whole state to work with because there is no way to read the state piece by piece.
#
Solution: layers
Firstly, there is no perfect, 100% working solution fits everyone and works everywhere. The layered approach is the one potential option I see interesting. I saw it in different teams invented independently so I think it is a good option to consider. The general idea is to split code in layers. Each layer represents a logic group of resources provisioning some state of the environment. The state of the layer is mostly independent and only explicit dependencies passed via data or remote_state terraform primitive. Those layers integrated vertically. The data passed from “top” layers to “bottom” but never in opposite or lateral direction. This setup creates a natural resource hierarchy. For example, we have resources in Google Kubernetes cluster deployed in Google Cloud project. This setup may be represented in three layers:
- Layer 1 defines the project itself, its properties and billing policies.
- Layer 2 represents cluster deployment, node pools and other in-project resources
- Layer 3 represents resources inside of GKE cloud
Every layer has its own specific responsibility. Any layer can use a specific set of permissions narrowed to the requirements of this specific layer. Any information passed between layers explicitly. We are using output on the upper level as an outbound and remote_state as an inbound for the information we pass. For multiple groups on the same level, you can use multiple folders (or separate repos) if needed. It helps with the isolation of concern and through this reduces blast radius as well as a current state size. This approach requires some mind shift, but it works quite well for big repositories with complex resource graph. This approach works best for teams because each part of the infrastructure you manage becomes mostly independent. A very important note is to avoid lateral data transfer. Any data passes only top to bottom. Without it, you can easily get a dependency loop: service A depends on service B while service B depends on service A. Terraform tracks resource graph in monorepos but for layered approach it is not possible because you are not directly passing resources between layers, only their values. This risk increases with the amount of services you manage. It will be impossible to avoid such problem in large infrastructures with thousands and thousands of resources. Layered approach gives a set of advantages:
- Each layer is compact and small. It is simple to observe and fix. Compact code produces a compact plan, and any potential problem in the plan looks much more visible.
- Data access management is much more… Manageable as you pass data explicitly and know what and where do you pass.
- Reduced resource usage and faster reconciliation thanks to the smaller code and state.
- The layered design approach naturally improves code discipline. It requires careful design and coding. It reduces sloppiness with ad-hoc solutions full of
#@TODO:and#@FIXMIE:as well as required to think of code as of module with API as a contract. It makes code more reliable. - Layered approach reduces blast radius. For the worst case possible the damage isolated to the current layer and its descendants keeping parallel and upper layers unaffected.
- As any layer has its own area of responsibility, you can design explicit permissions minimized to a specific layer’s needs. This approach reduces blast radius and makes security team happy.
- In fact layered approach encapsulates code in layers, creating a nearly accurate OOP with terraform, like a grown up language do. It helps to change the code in a specific layer without affecting other layers. You only need to keep the output data format compatible to previous revision. The output data format becomes public contract. It helps with the big teams with many people works with the same infrastructure. You can also update the version seamlessly by creating multiple versions of the output and change the consumer gradually. It moves infrastructure-as-a-code code to nearly enterprise coding patterns.
##
In practice
One picture works better than a thousand words. Let’s try it in practice. We have a simple setup built of 3 layers:
- Layer 0 sets project in the cloud
- Layer 1 creates Kubernetes cluster in this cloud. It deploys VMs, builds LoadBalancers and disk and so on. This layer also manages OAuth credentials in the authentication provider.
- Layer 2 manages resources inside of Kubernetes cluster deployed in Kubernetes. As our applications uses OAuth2 to authenticate users with SSO we need to have keyapirs from previous layer on this layer.
Honestly, it is not the best approach as Layer-1 has many roles in the same time. However, it is an example from my personal lab, and I’m the only person manages this setup so we can consider it manageable. The main reason we use layers is to simplify our life. Layers design is not a dogma, and you need to find a balance between amount of layers and complexity of data you pass from layer to another. As layers applied one-by-one, many layers will slow down the pipeline even if each layer works faster.
I use monorepo in this example, with all code placed in the same place. However, it is up to you to split it to multiple repositories. Here’s an example repository structure:
.
└── layer-0-project
└── layer-1-cloud
├── layer-2-front
└── layer-2-k8s
This file structure simplifies CI/CD configuration because we need to keep pipeline dependencies in mind. It also looks more observable as for me. We create EntraID OAuth applications in layer 1 and need to pass them “downstream”. We’ll do it with output:
locals {
[...]
entra_out = { for k, v in local.entra_config : k => {
"root" : v.root,
"group_id" : azuread_group.entra_group[k].object_id
"application_id" : module.entra_app[k].application_id
"client_secret" : module.entra_app[k].client_secret
"tenant_id" : module.entra_app[k].tenant_id
} }
}
output "entra_config" {
value = local.entra_out
sensitive = true
}
On the layer 2 we can receive this data using remote_state:
data "terraform_remote_state" "layer1" {
backend = "azurerm"
config = {
resource_group_name = var.tf_group_name
storage_account_name = var.tf_sa_ro
subscription_id = var.subscription_id
container_name = var.container
key = "layer1.tfstate" # state file name, same as on layer1
}
}
# now we're passing the data to K8s secret
resource "kubernetes_secret_v1" "grafana-oauth2" {
metadata {
name = "auth-generic-oauth-secret"
namespace = kubernetes_namespace_v1.gafana.metadata[0].name
}
wait_for_service_account_token = false
data = {
"client_id" = data.terraform_remote_state.layer0.outputs.entra_config["grafana"].application_id
"client_secret" = data.terraform_remote_state.layer0.outputs.entra_config["grafana"].client_secret
"tenant_id" = data.terraform_remote_state.layer0.outputs.entra_config["grafana"].tenant_id
}
}
In this example, we pass the data from Azure to Kubernetes without direct interaction with Azure. Layer-0 and Layer-1 doesn’t have access to Kubernetes cluster and Layer-2 doesn’t have access to Azure. Also, Layer-2 can’t pass data upstream or change anything in the state file of upstream layers. This makes upstream layers durable even in catastrophic failures on layer-2 level. It is a greatly simplified example. However, it illustrates the approach.
Here’s a bit more complicated setup:
.
└── layer_0_project
├── layer_1_gke
│ ├── layer_2_extdns
│ ├── layer_2_observability
│ └── layer_2_smesh
│ └── layer_3_vault
└── layer_1_storage
In this example we have Layer-0 on the enterprise level, layer-1 on a project level and layer-2 on a subsystem level. Going down with layers, you narrow the scope of your code and keep it manageable. The design of layers defines a dependency graph of resources between layers.
#
Outcome
Layers approach in terraform helps to simplify code and improve security and resiliency of the code. Each layer consumes less resources in comparison to large monorepo. Separated layers are safer and more secure: your code isolated by a role and scope of the layer. It improves development experience. Each layer may be developed independently, you only need to care about data format in outputs. It also improves security as you can specify permissions required for a specific layer and avoid over-provisioned credentials, very common in IAC setups. The main disadvantage of layered approach is to how to design code accordingly. Layered approach also affects on pipeline runtime as each layer takes more time to run: it is a full-sized terraform run with modules and providers installation, plan and apply. Well-organised code is a subtle balance between complexity and functionality of the code on each layer and amounts of layers you have.