Skip to content

Latest commit

 

History

History
216 lines (158 loc) · 10.3 KB

README.md

File metadata and controls

216 lines (158 loc) · 10.3 KB

Sample implementation of Azure Kubernetes Service "anti-DRY" bootstrap & maintenance strategy

Table of Contents

Background

It's hard to keep up with the evolution of Kubernetes. Significantly, upgrade strategy is a headache.

Blue/Green deployment is an effective strategy to mitigate the risk of upgrades. On the other hand, the challenge is how to manage the differences between the Blue and Green codes of infrastructure. DRY(Don't Repeat Yourself) is a typical concept that solves this problem, but its implementation tends to be cumbersome.

For example, there are several ways to achieve DRY with Terraform. Workspace, Module, specifying git tag as a source, git branching strategy & flow, etc. These are helpful ways, but it isn't easy to understand, design, operate, and maintain for developers early in the IaC and Kubernetes learning curve. In addition, in the rapidly evolving Kubernetes, it's common to want to redesign the code that creates the cluster. So, standardization of Blue/Green by Module often breaks down.

The codes in this repository are a sample of implementation of Azure Kubernetes Services Blue/Green bootstrap & maintenance without adopting DRY. Utilizing Terraform, Flux (v2), and GitHub Actions. The Blue and Green codes are not standardized, but the directories are split. In addition, it also has Terraform states for each environment. You can treat AKS clusters as immutable.

In this strategy, you should not persist data, state, and configuration in a cluster. All of them should be stored outside the cluster and connected for bootstrapping, configuration, running apps, and operation.

Overview

However, the difference of codes between Blue and Green must be easy to see. Therefore, this sample has support steps in CI, such as posting the diff as a comment at the time of Pull Requests.

DRY is a great concept, and you should be aware that it will come true in the future, but I hope this sample will serve as a starting point.

Prerequisites

Prerequisites & tested

Privileges required for execution

  • Admin
    • Azure Subscription Owner (Azure role)
      • Need User Access Administrator for role assignment
    • Azure Kubernetes Service Cluster Admin Role (Azure role)
      • For admin operation & Flux execution
      • Assign role to Azure AD group and specify it as terrafofm var
    • GitHub Repo control (GitHub PAT)
      • For execution of Flux with GitHub
  • GitHub Actions CI (Azure Service Principal)

In this sample, assigned strong privileges to admin so that you can try it smoothly for your PoC. In your actual operation, please be aware of the least privilege and fine-grained scope for you.

Usage

Prepare variables

The policy of this sample for variables such as IDs and secrets is as follows.

  • Operate in a private repository
  • Static IDs like Azure resource IDs can be written in the source code
    • To clarify the operation target and share it with the team as code
    • On the other hand, avoid hard code the entire ID as much as possible
      • Take advantage of Terraform interpolation and Flux substitution
    • Code encryption on repo is sometimes overkilling and complex procedures can trigger accidents
  • Secrets and values generated without regularity not written in the source code

You have to prepare the following variables for each envs(e.g dev, prod).

You can also use environment variables instead of tfvars file.

Bootstrap order

  1. Shared: Terraform dir
  2. Blue/Green: Terraform dir

You can operate Blue/Green in any order, but always be aware of the context of clusters.

Test

This repo have two types of test. The concept is based on Microsoft documentation.

Testing Terraform code

Integration

Integration test should be run frequently to detect minor errors early. So, integration test of this repo

  • Focus on format, static check and test that finish in a short time
    • terraform fmt, validate, plan
    • TFLint
  • Feel free to run

Set variables on integration.tfvars in shared/blue/green fixtures before test, or set environment variables.

E2E

E2E test should also be automated and always ready to run to see the impact of infrastructure changes on applications.

  • Actually create the infrastructure resources and run application on test fixtures
    • terraform apply (from Go test program)
    • create a sample app with Flux GitOps & check the endpoint (from Go test program)
      • chaos testing with Chaos Mesh
  • Feel free to run
    • Just run "make test"
    • Cleanup the resources after test automatically

Set variables on e2e.tfvars in shared/blue/green fixtures before test, or set environment variables.

CI

Pull Requests trigger the following GitHub Actions as CI. These actions post the result as comments to the PR.

  • diff between Blue/Green Flux files: Github Actions workflow

    • PR for files /flux directory
  • diff between Blue/Green Terrarform files: Github Actions workflow

    • PR for files /terraform/blue|green directory
  • format(check)/validate/lint/plan Terraform files: Github Actions workflow

    • PR for files /terraform/shared|blue|green directory

    Set variables on integration.tfvars in shared/blue/green fixtures before test, or set environment variables.

Note that this CI does not include the E2E test. Please consider if necessary.

Switch Blue/Green

You can join/remove services of each cluster to/from backend addresses of Application Gateway by changing demoapp.target in Terraform variable and applying it while continuing the service.

There are sample app and test script to help you switch between blue and green and see sessions across the cluster.

If you have both Blue and Green joined in the backend, then:

% kubectl cluster-info
Kubernetes control plane is running at https://hoge-aks-anti-dry-green-fuga.hcp.japaneast.azmk8s.io:443
[snip]
% kubectl -n session-checker get po
NAME                               READY   STATUS    RESTARTS   AGE
session-checker-76799c4797-8gq9x   1/1     Running   0          15m
session-checker-76799c4797-r4blx   1/1     Running   0          15m

% kubectl config use-context hoge-aks-anti-dry-blue-admin
Switched to context "hoge-aks-anti-dry-blue-admin".
% kubectl cluster-info
Kubernetes control plane is running at https://hoge-aks-anti-dry-blue-fuga.hcp.japaneast.azmk8s.io:443
[snip]
% kubectl -n session-checker get po
NAME                               READY   STATUS    RESTARTS   AGE
session-checker-76799c4797-kc896   1/1     Running   0          108s
session-checker-76799c4797-wjszz   1/1     Running   0          108s

% ./session-check.sh
{"count":0,"hostname":"session-checker-76799c4797-kc896"}
{"count":1,"hostname":"session-checker-76799c4797-8gq9x"}
{"count":2,"hostname":"session-checker-76799c4797-wjszz"}
{"count":3,"hostname":"session-checker-76799c4797-r4blx"}
{"count":4,"hostname":"session-checker-76799c4797-kc896"}
{"count":5,"hostname":"session-checker-76799c4797-8gq9x"}

Requests are distributed across both clusters and multiple pods, but the session is shared by Redis, so it counts correctly.

Then, comment out blue from target and apply it.

demoapp = {
  domain = "internal.example"
  target = [
    # "blue",
    "green"
  ]
}
{"count":41,"hostname":"session-checker-76799c4797-wjszz"}
{"count":42,"hostname":"session-checker-76799c4797-r4blx"}
{"count":43,"hostname":"session-checker-76799c4797-kc896"}
{"count":44,"hostname":"session-checker-76799c4797-8gq9x"}
{"count":45,"hostname":"session-checker-76799c4797-r4blx"}
{"count":46,"hostname":"session-checker-76799c4797-8gq9x"}
{"count":47,"hostname":"session-checker-76799c4797-r4blx"}
{"count":48,"hostname":"session-checker-76799c4797-8gq9x"}
{"count":49,"hostname":"session-checker-76799c4797-r4blx"}
{"count":50,"hostname":"session-checker-76799c4797-8gq9x"}
{"count":51,"hostname":"session-checker-76799c4797-r4blx"}
^C
Number of unrecoverable HTTP errors: 0

Removed the Service IP of Blue without disruption. So, you can destroy the Blue cluster.

Notes