Test-driven development for infrastructure as code

15 Oct 2025 🔖 devops
💬 EN

Systems and tests are both Booleans
Simon Willison’s fully-logged “empty” project trick
Example TDD workflow for IaC
Why TDD works for IAC
Recommended reading

Much as I’m hooked on it, I’ll admit test-driven development (TDD) has a reputation for being a bit intimidating to adopt in traditional software application development.

Luckily, it’s not just easier, but arguably almost tailor-made for the work of sysadmins taking on the challenge of infrastructure as code (IaC).

Systems and tests are both Booleans

Most sysadmins are rightly cautious about automation. After all, a typo in a script can take down a server, not just misalign a button. The stakes are higher, and the fear of scaling up mistakes is real.

But here’s the twist: sysadmin tasks are inherently Boolean (true or false). Either the system is configured correctly, or it isn’t.

This clarity makes it easy to write tests that validate desired state – perfect for TDD, which is all about writing tests before writing any “real” code.

Simon Willison’s fully-logged “empty” project trick

Simon Willison, on Teaching Python episode 150, recently shared a productivity hack:

“I like (starting) my new projects with a single passing test – and the test can be assert(1, 1, 2).

“But the important thing is the testing framework is all configured.”

IaC lends itself to Simon’s approach so well. Everyone on the team doesn’t really care how the system gets into its desired state – they just want visibility into whether it is in the desired state.

If each new IaC codebase you create starts mostly “empty,” but containing a working CI/CD pipeline that runs tests and beautifies sharing the results with colleagues (like Azure DevOps’s PublishTestResults), your team can be set up for success before you even start writing your first line of “code.”

Screenshot of the Azure Test Plans listing 8 test results, 7 passing and 1 failing.

Any old “1+1=2” test case can:

prove your CI/CD works, and
be a template, when you’re ready to add meaningful tests.

Example TDD workflow for IaC

Here’s a game plan to adopt TDD for IaC without risking your infrastructure.

Step 1: Learn to assert 1 + 1 equals 2

Write 1 piece of read-only desired state validation code, and learn to run it.

Try it with a 1+1==2-type assertion if you want to learn what “passing” tests report out like.
Try it with 1+1==3 if you want to observe failing tests.

Language features

Terraform/OpenTofu: data lifecycle postcondition
Ansible: ansible.builtin.assert
PowerShell: I’m still working on this … I think you might have to roll your own assertion logging, because this isn’t exactly the kind of validation that Pester is meant for. I’m thinking it might involve some hand-logging to local file storage in an XML format.
- Anyway, once you figured out the logging bit, you’d wrap your loggers around read-only Boolean commands that make sense for whatever you’re up to, like Test-Path.

Step 2: Make and run CI/CD

Now it’s time to write a CI/CD pipeline.

Upon every source code check-in, your CI/CD pipeline should:

Run the validation you just wrote.
Run static code analysis tools like linting and security scanners.
Cleanly+beautifully reports the results to all of your colleagues.

Pro tip

Flip your test test #1 from deliberately passing to deliberately failing, to make sure that the beautified reporting works both for passing and failing tests.

Value

Once you finish this step, you can consider yourself to have written a “template” along the lines of the ones Simon talked about.

(In fact, you might want to publish it somewhere that your colleagues can spin up new codebases from, to save them the trouble.)

Step 3: Add your tests

Imagine you’re training a new sysadmin on the following “trainee” steps:

Delete the C:\example folder and all its contents.
Create a C:\example folder.
Change C:\example’s permissions to read-only except by machine administrators.

Notes:
- Don’t automate these – that’ll be the very last step. Just ponder them, for now.
- Yes, I know it’s probably wasteful to delete and recreate the folder rather than just emptying it – this is just a silly tutorial example. 🙂

Your “trainer” validation steps, interspersed one at a time after each “trainee” step, might look like:

Validate that C:\example cannot be found.
Validate that C:\example can be found.
Validate that C:\example is only writeable by machine administrators.

Language features

Go ahead and add these desired state validation tests to your codebase, again, using IaC language features like postcondition or assert.

Useless tests are normal

It’s okay if some of them fail (for example, without a chance to do any work between validation #1 and validation #2, yes, one of them is going to fail, because they’re literally opposites of each other). Run the tests back-to-back with the system in opposite states, just to make sure they pass/fail as expected, and move on.

You can also comment out, or use the IaC framework’s skip-style notation if it has one, for tests that don’t yet apply because you’re currently running them all back-to-back with no “work” steps in between. So in the above example, that would mean disabling validation #1, presuming that your system already is in its desired state.

But code them all up, anyway, because it’ll be handy to have them ready to go, later, when you finally get around to automating the little steps that get your system the way you want it. (You can always delete them for good if they turn out useless forever, but start by writing them all. It’s great practice at writing code in your IaC language.)

Value

If you never make it any farther than this, congratulations – you’ve created “living documentation” that helps you do “Sunday” validation work a lot faster after a “Saturday” outage.

Not to mention, the new trainee can kick off these validations and, within minutes/hours of completing their first “Saturday” outage work, brag to the whole team that they did a good job.

No more nail-biting all Sunday, waiting for manual validations.

Step 4: Perhaps intersperse pauses for manual work

Some IaC languages let you intersperse prompted placeholders for manual work (e.g. ansible.builtin.pause).

If you’re taking a slow, cautious approach to automation, with sysadmins who are still feeling very tentative about becoming experienced software developers (because honestly, that is what they’re becoming as they learn to write IaC automationd), these placeholders can be great to scatter amongst your validations. Your codebase could now look like:

Prompt the admin to delete the C:\example folder and all its contents and type “done” when done.
- (new code)
Validate that C:\example cannot be found.
- (preexisting code from step 3)
Prompt the admin to create a C:\example folder and type “done” when done.
- (new code)
Validate that C:\example can be found.
- (preexisting code from step 3)
Prompt the admin to change C:\example’s permissions to read-only except by machine administrators and type “done” when done.
- (new code)
Validate that C:\example is only writeable by machine administrators.
- (preexisting code from step 3)

Reenable formerly useless tests

Note that now it’d make sense to re-enable all those formerly “useless” tests, since the state of the system actually would go through points where C:\example is missing and also points where C:\example is present.

After all, if a sysadmin makes a mistake while doing their “prompted” manual work, they’ll want fast feedback that they failed at that step.

Refine tests

Real-life experience also gives you a chance to realize that you didn’t quite imagine your “validation” steps correctly.

Go ahead and fix up any read-only validations in your code that aren’t reporting out the way you’d expect them to because, now that you’re doing real-world systems work, you realize that you didn’t quite design them correctly.

That’s normal. You can’t think of everything ahead of time. Fix them now.

Runtime considerations

If you’ve only been running your IaC on the CI/CD pipeline built into your source code version control system, in response to code check-ins, you might find that introducing pauses for manual labor pushes its limits more than you’d like.

CI/CD pipelines aren’t really meant to run for hours and hours, waiting for interaction.

You might want to spin up some sort of secondary runtime meant for more interactive executions of your codebase.

Make sure you’re still running the latest version as found in version control – the whole point of IaC is to introduce consistency – but don’t be surprised if you need to figure out a slightly longer-lived machine, with an interactive CLI, if you introduce pause steps into your codebase.

Also, please don’t let your secondary runtime skip the important step of beautifully reporting back its validation results into the same centralized place that your normal CI/CD pipelines report it into.

Just because you’re running your code on a different machine than usual doesn’t mean your whole team should suddenly be blind to what happened when it ran!

In fact, if you want to be really thorough, you could make setting up both of these runtimes, and making sure that invoking the IaC reports QA results out equally well no matter which runtime you’re invoking your IaC from, part of step 2 (your “Simon-style templates” work). Maybe you find it inspiring to eliminate all of the “but the code only works on this machine” problems while you’re still at assert(1,1,2) – I support you, that sounds awesome!

Value

Interesting conceptual discussions here and here about the value added even if you never make it past this step.

Step 5: Actually automate the work

Finally, it’s time to implement code to automate manual work.

This is where it’s intuitive to be tempted to start, but you’ll thank yourself later if it’s the last thing you do.

Code quality mostly comes from the scaffolding of collaboration and quality processes you’ve put in place – the “Simon” stuff – not from the moments you spend writing “real” code.

Your codebase could now look like:

Auto-delete the C:\example folder and all its contents.
- (new code replacing pause if you ever did that step)
Auto-validate that C:\example cannot be found.
- (preexisting code from step 3)
Auto-create a C:\example folder.
- (new code replacing pause if you ever did that step)
Validate that C:\example can be found.
- (preexisting code from step 3)
Auto-change C:\example’s permissions to read-only except by machine administrators.
- (new code replacing pause if you ever did that step)
Validate that C:\example is only writeable by machine administrators.
- (preexisting code from step 3)

Language features

Terraform/OpenTofu: resource
Ansible: ansible.builtin.copy
PowerShell: This is where you finally get to write, say, Remove-Item.

Caution: code declaratively

Here’s a pro tip:

When you finally start implementing code that “changes things,” avoid embedding raw PowerShell or Bash in IaC code (e.g. Terraform, Ansible) unless absolutely necessary.

It’s an antipattern.

Instead, use native modules/providers. They’re typically safer, idempotent, and auto-logged.

Favor: Ansible’s copy module.
Avoid: PowerShell’s Copy-Item inside Ansible’s win_shell.

Why TDD works for IAC

Sysadmins get hands-on training writing IaC code, without risking system state (because they’re only coding validations, not actions, at first).
Tests are easy to define: Just imagine how you’d check a trainee’s work.
Living documentation: Even if you never automate the “real work,” your automated validation code can serve as “living documentation,” which builds trust that even trainees’ mistakes will be caught promptly enough to let them safely perform upgrades.

Sysadmin work is naturally declarative, idempotent, and log-worthy.

You want the system to be in a certain state, you want to be able to check that state repeatedly without side effects, and you want your colleagues to share your confidence that everything is correct.

TDD and IaC are both about declaring what “done” looks like, then proving it.

If you’re overwhelmed by new IaC tools, wondering how on earth they’re supposed to add any more value than you already could with a shell script, give TDD a try.

It’ll help you flip your brain from “thinking forward from steps” to “thinking backward from desired state.”

In turn, you should find lower learning friction, because that’s how most IaC languages/frameworks/tools (e.g. Terraform, Ansible) are meant to be coded.

Salesforce, Python, SQL, & other ways to put your data where you need it

Need event music? 🎸

Test-driven development for infrastructure as code

Table of Contents

Systems and tests are both Booleans

Simon Willison’s fully-logged “empty” project trick

Example TDD workflow for IaC

Step 1: Learn to assert 1 + 1 equals 2

Language features

Step 2: Make and run CI/CD

Pro tip

Value

Step 3: Add your tests

Language features

Useless tests are normal

Value

Step 4: Perhaps intersperse pauses for manual work

Reenable formerly useless tests

Refine tests

Runtime considerations

Value

Step 5: Actually automate the work

Language features

Caution: code declaratively

Why TDD works for IAC

Recommended reading