Infrastructure as code homework #2
16 Oct 2020
Live notes as I take a hands-on workshop about Azure and Terraform:
You push something into a cloud-based Git repository and Terraform picks it up and implements the logic you specified in these config files.
Teacher suggests something like Ansible for, say, stopping and starting a server, particularly a DB one. Not Terraform. Just provision w/ Terraform.
Some companies have a nightly script that kills any infrastructure deployed point-and-click rather than through code, to make sure everyone does things in controllable ways.
Azure-CLI is a Python wrapper; Azure-PowerShell is a .NET wrapper?
We create a Resource Group, then (VLAN (vnet or “virtual network” in Azure terms) & Subnet in separate steps), then a Network Security Group.
As part of creating the NSG [basically a stateful firewall cloud appliance – can operate on a VLAN, subnet, or actual cloud resource such as a machine instance], we’ll create the rules.
Next, create a public load balancer within the resource group w/ properties indicating it’s a “basic” load balancer, and specifying the “vnet” it’s connected to and the “subnet” (from that VNET) it’s connected to.
The load blaancer goes from, say, your containers, to, say, the internet or a private network.
Generally speaking, you would want to allocate a static public IP address to a load balancer.
Setting --public-ipaddress-allocation
to the tools implies creating a public-typed load balancer; when coding against APIs, you have to be more explicit.
Writing your own Terraform in PowerShell doesn’t scale well. Just use it.
Okay, so now create a “storage acount.” Make sure to put it into the resource group. I think he was saying he used the “object” kind, which can be cheap.
Wait, why is he showing me his “containers” as part of showing me the “storage account” he set up? What’s that? He barely had any code in the storage account.
Okay, so anyway, create a managed disk from a RedHat Enterprise Linux image, --image-reference "RedHat:RHEL:7-LVM:latest"
, (he named his DKCESMT01_OsDisk_00
and always suggests to put something like OsDisk
in your OS disk names and DataDisk
in your data disk names).
This also helps your infrastructure-code pipelines do things like quickly parse available things to loop over … I think?
Now we create the VM’s “NIC” (network interface).
Then we update that NIC to put it into the network security group (NSG) we set up.
Now we create a VM, put it into a resource group, give it a location, attach the OS disk from above, give the computer (/etc/host/) a name, specify the NIC(s) it’s to use, specify explicitly whether Windows or Linux, and specify machine size.
Note that what he didn’t set is a password or an SSH public-private key pair in this example of using the CLI … so we sort of created a useless VM. In the Terraform code he’ll show later, there will be a key pair.
He points out, don’t forget to set a proper root password and otherwise harden the OS.
If we try to look at “boot diagnostics,” it says it’s not configured.
His next command enables boot-diagnostics for that VM. He says it could’ve been done in the VM create, but he likes to string together little single-purpose commands, not to big nested monsters, in his configs.
Under “Support + troubleshooting” there’s a thing called “Serial Console,” but I missed what it does.
Now we create a public IP address cloud resource, and next we update the NIC attached to the VM with that IP address.
Public IP addresses cost a few dollars a month.
Remember that you don’t attached a public IP address to a VM, you attach it to a NIC that’s attached to a VM.
Note that “quick-create-cli” under “virtual-machines” documentation is, in his opinion, pretty incomplete compared to what he just showed us.
He tried to telnet to his IP address, but he realized that he hadn’t yet “applied the security group to it,” so it blocked him … about which he’s happy.
Now he created a “backup vault” – for Azure’s automatic site recovery service (ASR). A backup of the machine image goes into “object storage.” Teacher thinks of the big 4 A/G/M/O, Microsoft really got it right. “Make me 3 copies of the image, 1 in each of the 3 datacenters of the region, then replicate it to another region and make 3 more regions over there.”
In object storage, there’s a term called “silent data loss.” It’s when a reference to where the storage object actually lives becomes corrupted due to a software defect, which is a defect in probability. So now there’s an object taking up 0’s and 1’s with no way to get to it so it might as well have been deleted. When you look at millions/billions/trillions of objects stored in object storage, suddenly the probabilities start to look sort of likely.
In IT, we overcompensate for this risk with multiple copies.
Azure’s automatic site recovery service also cleans things up – does error correction, not just detection – when this happens.
Some cloud providers, teacher says, put the onus on YOU to take care of all this. “ASR” from Microsoft is REALLY STRONG, per teacher.
Now we enable the ASR protection against our VM and into the vault we created.
“Oracle Cloud is pretty cheap compared to Azure.”
Pay attention to the API warning you that commands are on their way to being desupported, not yet fully supported, etc.
Consider going over to GitHub and look at the conversation between MS dev’s & MS solution providers for things in preview to get a sense of how it might change in the future or to weigh in.
Now we’ll get versions for Kubernetes (“aks”) available in the Central US, and then we’ll create a Kubernetes cluster in the central US region that’s from within that list (and put it into our Resource Group, of course, etc.) As part of that command, we’ll give it a VM size, specify whether it has an SSH key, say how many nodes we want in it, enable Azure Active Directory (“AAD”), etc.
If you have someone who knows Active Directory really well, train them up in “role-based access control” for AAD, because it’s a PAIN for developers to worry about, teacher says – train up someone who can do it for you at your org.
RBAC is super important to securing things like Kubernetes correctly. Don’t disable AAD. Just go through the pain with an expert.
When you’re not running a container, delete it. Don’t worry about losing stuff – if you’ve done things right, that’s not a concern!
BREAK
With Terraform, if you’ve lost your state, you can lose your infrastructure the next time you run terraform apply
.
If you corrupt the state, you may not be able to manage your existing infrastructure w/o an arduous “import state from the cloud” process
If you don’t properly protect the state, 2 devs can corrupt it trying to edit it at once.
If you don’t back up the state, how will you roll back if you need to?
What you put a terraform state onto a shared store, what you do in that file may not be what a colleague sees. Know the trap doors so you don’t fall through them.
Azure DevOps combined w/ cloud storage and Terraform Enterprise do a great job providing viable means to protect your terraform state.
Having the state in a VM instance in the cloud being shared by multiple devs is NOT going to be pretty. Nor is having it on one dev’s computer. When you start Terraform in the first place, START IT RIGHT.
He showed us where to read what state is on TF’s docs, then showed us what one looks like. (He downloaded it to his desktop out of his Azure cloud. Speaking of keeping state in Azure…)
In its simplest state, his TEST_xx.terraform.tstate
was just a JSON file. Its “lineage” property is an Azure Active Directory resource representing my teacher’s user account, BTW.
He took a moment aside to go into Azure CLI and tear down the work he’d already been doing. (He had to disable vault to effectively do so?)
At the least granular, break up your Terraform state files by your Azure “subscription” – but recommended to go more granular like the “resource group” level (or, in Oracle Cloud, your “compartment”).
The smallest possible increments you shrink your changes, the better.
Consider splitting up your megalithic-system into multiple “resource groups” in Azure. That way, if you blow up Test, all you’ve blown up is Test.
When you create code in Terraform, you create .tf
“modules.”
It’s conventional to have a main.tf
but you can let anything (e.g. 000_azure.tf
) to serve as “main.” Note that the ENTIRE set of .tf
files count as one module – and TF treats them as one big piece of code. Write accordingly.
The use of variables – including object & array typed contents – is a strength of TF.
variable "location" {}
is the equivalent of var location = NULL;
or something.
See learn.hashicorp.com/tutorials/terraform/azure-variables, but know that it doesn’t really give you any practical application of variables tips – that’s why they’re throwing us a repo at the end of this.
Consider using “type constraints” with your variables – you’d rather find out you goofed w/ a Python error than a infrastructure-crashing error, right?
ERP systems are relatively small compared to retail applications, so Terraform index()
isn’t really something he worries about when deploying an ERP to the cloud.
Point is, be aware that there are functions (e.g. concatenate()
) and you can build functions.
“I’m not gonna figure out everything by looking at this (the
man
home screen) – that’s why I need this (the web documents)”
(You betcha!) ;)
With these 4 commands, it’s everything I need to know to use the TF CLI to plan for, create, modify, and destroy cloud resources.
He likes a .tf
naming pattern of:
- Digit, Digit, Digit (run sequence)
- Underscore
- useful_nouns_describing_resources
.tf
Teacher’s configured Terraform to write his
.tstate
file to the cloud, not his disk.
I think.
With the way teacher works in Azure, he doesn’t find “output plan” necessary, but you might want it in yours. Also note that you can limit parallelism. DigitalOcean, for example, doesn’t like too much. Sometimes, Azure drops some buggy code (e.g. 2.2.8) where you could only get things running w/ parallelism dropped down to 1.
So, he is showing us adding us just 1 thing to the TF state file … he’s adding a random number to it. (It’s handy for naming Azure storage resources. Read more about Azure for more on that.)
He had to re-download it from the Azure web console, so I guess he did indeed have it getting stored in Azure.
Terraform examples of “_(TF) resource code blocks” often seem a little AWS-heavy, but there’s stuff out there for every cloud.
Terraform can be pretty darned version-sensitive, so be careful to note in blog posts what version someone was writing for. Example, http://www.devlo.io/if-else-terraform.html
Irritating: looping over a [{},{},{},{}], see https://stackoverflow.com/questions/57570505/terraform-how-to-use-for-each-loop-on-a-list-of-objects-to-create-resources
https://blog.gruntwork.io/how-to-manage-terraform-state-28f5697e68fa
Rather than muck with the main
Git branch’s “terraform state,” spawn a new main working area (“workspace”) within the Terraform State File. Or maybe a different Git branch? Didn’t quite catch.
Okay, so teacher just copied the TF to create a resource group into another folder and ran terraform plan
from there. He noted to remember that he destroyed this morning’s stuff.
We looked at the plan, then he did terraform apply -auto-approve
.
Going back to the Azure portal, he showed us that, indeed, his TEST_rg
resource group exists!
And if he looks at his centralized TEST_rg.terraform.tfstate
file, it shows that that it’s at the 20th iteration of [chains/change?] with this state file (serial
). Looking at the resources
, the first one is an azurerm_resource_group
called myResourceGroup
in the state file, and its provider is from registry.terraform.io/hashicorp/azurerm
, and looking at the ID of it, sure enough, it’s the one you see in the cloud.
Okay, moving on looking at 010_azurerm_vnet.tf
, you can see that it references azurerm_resource_group.myResourceGroup.name
, which means that it’s looking in the TF state file for a resources
object with that type
and name
, then under instances[0].attributes
, it’ll look and grab a name.
And 001_....tf
references a var
, not a hardcode.
Variables & references all the way down!
- How to avoid accidentally wiping out your TFState w/ TF itself:
- Create a resource group JUST for your TF state files, get together w/ your Active Directory expert and you harden the security on that resource group so that the only way to modify state is through your proper channels. There are no “dev staffer” user IDs that can write to the production TFState Azure container. Only the Azure Service Principal User Account (a service account) can write to the production container with TFState.
BREAK
Okay, I came back 2 min late…suddenly we’re talking about ansible. Ah – maybe only just in the context of “you’ll need Python on your local machine.”
py-az2tf-master
is for Azure existing into TF file
Kyle “Let’s Do Devops” pt 2 on Medium is actually accurate about setting up a Terraform pipeline in Azure DevOps. Microsoft’s instructions are anything but.
Hashicorp’s Terraform Enterprise is $20/month but can’t do Ansible code (needed for reasonable Kubernetes cluster management once built, start/stop, patch upgrades…) – it’s a one-trick pony – Microsoft Azure DevOps comes out to about $52/user/month for anything usable but actually lets you do these kinds of things. If you’re in Azure’s infrastructure (including being multi-cloud), just be ready to spend this money.
Make your main .tf
file (his, 000_azure.tf
) pin the version of Terraform that you’ll be using. Forces the machine on which you’re running the TF to match what’s in the file. That’s not as bad as it sounds, since Terraform is just a single executable and you can keep a bunch of versions on your computer at once.
The property key
of backend
specifies which .tstate
file this “main” .tf
goes with.
It’s good to make comments right at the top about the local-dev dependencies with which you wrote & tested a .tf
file.
He also finds it useful to have top-of-file instructions about how one would go about getting access to the .tfstate
file, plus a warning to pay close attention to things before mucking around.
He has a 000_vars.tf
file with all his constants like azure_region
as centralus
, etc.
When setting a TF variable, this makes you get prompted for a value:
variable "azure_region" {
type = string
default = {}
}
Moving on to things like network
, things get more complex – here, he uses typing to get strict.
variable "network" {
}
type = object({
blah blah blah object properties schema
}
default = {
blah blah blah that matches (note: `null` is valid for a subproperty)
}
If you set up TEST to look like Production, down to the 172.16.0.0/16
address spaces and whatnot, you’ll like your life when it’s time to do things in prod after testing them in test. Just change a few keyword names.
Use DuckDuckGo not just for privacy, but because its search results can be more straightforward. If there’s nearly-duplicate information (e.g. different versions), it’s less likely to hide it from you.
He took us through a lot of looking up documetation – e.g. account_replication type
valid SKUs, etc.
ERP: containers?
Don’t create a VM instance and install an ORACLE_HOME on it and create an Oracle DB and containerize it.
You can’t containerize a Windows app w/ Docker tools.
You can containerize anything that’s on Tomcat.
Theoretically, you could containerize WebLogic but it’s got a licensing cost so ewwww, get rid of it.
Mostly happening in the classic web apps
Teacher does see some containerization in DW, but it’s a beast, so get good at containerization w/ classic web apps first.
Don’t try to containerize upgrade manager.
Containerization helps w/ consistency, quick workload scale-out…
- (note: about 30min outage to do a machine shape change), then add another node to your container pool and have more running containers running the workload.
Being “agile & accurate” – a “Dept1 Regulatory Upgrade” example:
- I’m applying Regulatory Upgrade to TEST.
- I build 1 container to do that.
- I also want to test it across DEPT2, DEPT3, etc.
- I want them all to have the same “container” at the same time and not know it.
- I create the container and push it to the test repo.
- Kemp / Fortinet web application firewall also configurable as part of Docker w/ YAML
- And if I blow something up, I just roll back.
- Whereas I have to figure out HOW to roll back if I ansibled my way to firewall changes. (Though teacher isn’t sure where exactly the ease/difficulty line is between Ansible-against-machine-images & Docker containers; says he’s more of a deep-infrastructure guy than an apps guy.)
- Then when it’s time to go to production, you take the approved (“golden”) container, you updated the PROD container registry, then you schedule, the outage and replace the containers during the outage.
- If vendor’s Regulatory Upgrade has a big fat bug, you just roll back.
Containerization Item 3: having an exit strategy – you can’t just pick up an Azure machine image & take it to AWS. The kernels have been tweaked for hypervisor. With docker images, pick up & switch if you like.
With containers, you’re thinking “Tomcat & above,” the OS issues are abstracted away.
The strength of Ansible is configuring the container on its way to “spin-up” (and also reconfiguring the load balancer and such)
The way a classmate structured things is so that the container images don’t know what environment they’re “for.” All that stuff is injected on their way to spin-up. They use Kubernetes for that, though, not Ansible.