Backing up AKS cluster with Velero

ojas kale
Egen Engineering & Beyond
5 min readJun 26, 2020

--

Velero + AKS = Good nights sleep

Docker and Kubernetes are the modern-day runtime environments in almost every company who aims to implement cutting edge technology. It gives so many conveniences out of the box like rolling deployments, high availability, restarting failed containers (aka self-healing), well-managed secrets, and the list goes on...

Kubernetes also provides something called persistent volumes meaning storing container data outside the container. So if the container dies, a new container is brought up and is linked to the persistent volume without losing data. Isn’t that great? well, Kubernetes does that for you. If Kubernetes is doing so much why do you need something like Velero to do the same thing? Of course, there's more to it. What if the whole node dies ? or one of the developers accidentally deletes the nodes (happens more often than you think. :P guilty). Then you are losing persistent volumes as well and if the whole cluster dies nobody knows what was the state of different pods and data in it was just before it dies.

Well, there comes Velero. Velero gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. here's what Velero can do for you.

  • Take scheduled backups of your cluster and restore it in case of loss.
  • Migrate cluster resources to other clusters.
  • Replicate your production cluster to development and testing clusters.

How Velero works:

Not going full Rambo here. just a quick overview. Kubernetes stores cluster state (basically the k8s API objects) in the folder within the cluster /etcd.

Velero keeps backup of this folder in some remote location of your choice. The only condition is these remote locations should be object stores. Common examples are AWS S3 buckets and Azure Blob Storage. Now even if your cluster dies you know the state through these remote locations and can be restored, again with the help of Velero.

Common Use Case:

Say you have an Elasticsearch cluster running within your Kubernetes cluster. This cluster has 100s of GBs of data stored on persistent volumes. Valero will store k8s API objects in the container blob storage and persistent volumes will be stored as a snapshot. (say we are using Azure)

A simple command can fetch and restore these files from container blob storage which has pointers to these snapshots and viola! your cluster is up and running.

Action:

There are two steps for using Velero: first, install the Velero binary on your host machine and then run the Velero command to install it on your cluster.

to install it on Mac it's pretty straight forward.

brew install velero

to check if the installation was successful try running

velero --version

Installing Velero on AKS cluster

Prerequisites:

  1. a blob storage container where Velero can store the backups
# name of the Storage where storage container is
export AZURE_STORAGE_ACCOUNT_ID="velerobackup"
# Name of the storage account's Resource Group
# I have a very good reason to call it RG and not Resource Group
# you will learn it soon
export RG="backup"
# name of your backup container
export BLOB_CONTAINER="aks-backup"
# This resource group is different than the one you set above
# Explanation after these commands
AZURE_RESOURCE_GROUP=$(az aks show --query nodeResourceGroup --name <AKS-CLUSTER-NAME> --resource-group <AKS-CLUSTERS-RESOURCE-GROUP> --output tsv)
# Azures subscription ID you are working in
AZURE_SUBSCRIPTION_ID=$(az account list --query '[?isDefault].id' -o tsv)
# Azures Tenant ID
AZURE_TENANT_ID=$(az account list --query '[?isDefault].tenantId' -o tsv)

Did you notice there are two different resource groups in the above commands? well if you did congrats!!! You just saved at least an hour.

The reason behind this is when you spin up an AKS cluster in a resource group of your choice, behind the scene, Azure creates another resource group, it is the “cluster resource group” and is used to represent and hold the lifecycle of resources underneath it. This is weird but this is by design and we should deal with it. more this here.

Next, you will need a service principle that will allow the AKS cluster to read and write files to this storage account.

# get service principles password
AZURE_CLIENT_SECRET=$(az ad sp create-for-rbac -n $AZURE_STORAGE_ACCOUNT_ID --role contributor --query password --output tsv)
#get service principles ID
AZURE_CLIENT_ID=$(az ad sp show --id http://$AZURE_STORAGE_ACCOUNT_ID --query appId --output tsv)

By now you have all the variables required to install Velero on the AKS cluster. Next, you’ll dump all the required values from above into a file called credentials-velero. You can name it whatever you want just make sure to update the commands accordingly.

echo "\
AZURE_SUBSCRIPTION_ID=$AZURE_SUBSCRIPTION_ID \n\
AZURE_TENANT_ID=$AZURE_TENANT_ID \n\
AZURE_CLIENT_ID=$AZURE_CLIENT_ID \n\
AZURE_CLIENT_SECRET=$AZURE_CLIENT_SECRET \n\
AZURE_RESOURCE_GROUP=$AZURE_RESOURCE_GROUP" \
> ./credentials-velero

Prep work is done. Now comes the crucial part. Installing Velero on AKS cluster.

velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.0.1 \
--bucket $BLOB_CONTAINER \
--secret-file ./credentials-velero \
--backup-location-config resourceGroup=$RG,storageAccount=$AZURE_STORAGE_ACCOUNT_ID,subscriptionId=$AZURE_SUBSCRIPTION_ID \
--snapshot-location-config resourceGroup=$RG,subscriptionId=$AZURE_SUBSCRIPTION_ID

In the above command, you are passing velero-credentials file as a value to argument secret-file. You are also specifying where to store cluster state (CRD) and where to put snapshots of persistent volumes.

After running this command you will see a long list of resources created in your AKS cluster. Most of which you don’t need to understand. Everything will be created in a newly created namespace called velero.

What you should look for is if the last line says

Velero is installed! ⛵ Use ‘kubectl logs deployment/velero -n velero’ to view the status.

If you see this you are good.

Now one crucial step is tagging your resource which needs to back up. whattttt??? doesn’t Velero back up everything? well, it does, but not persistent volumes. you have to explicitly specify persistent volumes of which pods to back up.

you will have to specify a label on the pods to tell Valero to back up their persistent volume as a snapshot. A label can be anything that makes sense to you. we will use backup=true for our purpose.

kubectl label pods <NAME-OF-POD> backup=true

finally, you can backup. here's a command.

velero backup create <NAME> — selector backup=true

This will back up every pod who has label backup=true in its metadata along with CRDs in etcd folder. Visit your storage container and snapshots to verify your backup is successful.

PS: It takes a while for snapshot backups to show up in snapshots listing in Azure. In my case, it was 5 minutes.

To list your backup in the terminal run

velero get backups

you should see one backup.

To restore this backup you will simply have to run

velero restore create — from-backup <NAME>

This will restore from a specific backup.

To test this out, Try deleting a deployment after backing up the cluster and restoring it.

Usually, you would want to run backups in a timely manner. Velero provides a utility for that as well.

velero create schedule daily --selector backup=true --schedule="@every 24h"

This will create a scheduled backup that runs every 24 hours.

I hope you find this helpful. Thanks.

--

--