Skip to content

Dockerization, Updated Pipeline Workflow and DORA Metrics

Sarah Aligbe edited this page Aug 9, 2024 · 3 revisions

Dockerization

This document outlines the process of Dockerizing our existing applications, setting up environment-specific configurations, and refactoring our Continuous Integration/Continuous Deployment (CI/CD) pipeline to meet our new deployment needs.

We created a Dockerfile to build our application into an image and environment-specific docker compose files to deploy the container in various environments. The docker compose files build a container for our code's backend and database. The containers are then mapped to the nginx server we have running in our server

Updated Pipeline Workflow

Old Pipeline Workflow

Our old pipeline involved two workflows to build and test our application on pull requests made to the repo and then deploy the application to our server once the PR is merged and the build and test workflow is successful. We had environment-specific workflows for the dev, staging, and main branch (which is our production branch)

New Pipeline Workflow

Our new pipeline workflow involves three workflows. The first is to test the application to ensure it meets standards, then a pr-deploy workflow is ran to deploy our application as a docker container in an isolated environment. Once these workflows pass and the PR is merged, the last worklow runs which builds our application image on the pipeline and uses docker save and gzip to compress the image. The image is then securely copied into our server and docker load is used to load the image and docker-compose up starts the container

Setting up DORA metrics

Overview

The DevOps Research and Assessment Metrics (DORA) are used to measure DevOps performance. These metrics help to improve DevOps efficiency and communicate performance to business stakeholders. These four key metrics, which are divided into two core areas od DevOps are being monitored for our project: They are:

  • Deployment frequency and Lead time for changes (which measure team velocity)

  • Change failure rate and Time to restore service (which measure stability)

To access these core metrics for our project, we will be making use of the Github API to collect metrics about our repository's workflow runs, pull requests, and commits. We will also write a python script to that uses the Prometheus client library to expose these metrics, which can be scraped by a Prometheus server for monitoring and alerting purposes with grafana.

Steps:

Step 1. Create a folder for the custom Dora exporter:

mkdir anchor-dora-exporter

Step 2. Create necessary files

Within this folder, create the files for the python code, the dependencies and the systemd service named main.py, requirements.txt and anchor-dora.service respectively. Also, create a file named .env from which the script will reference the environment variables specified.

.env Sample .env file

GITHUB_TOKEN=[Input github token here]
REPO_OWNER=hngprojects
REPO_NAME=hng_boilerplate_python_fastapi_web
PORT=8084

The contents of the files are shown below:

main.py

import requests
from datetime import datetime
from prometheus_client import start_http_server, Gauge
import time
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# GitHub API variables from environment variables
GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')
REPO_OWNER = os.getenv('REPO_OWNER')
REPO_NAME = os.getenv('REPO_NAME')
PORT = os.getenv('PORT')

# Prometheus metrics
deployment_frequency_gauge = Gauge('deployment_frequency', 'Total successful deployments', ['environment'])
cfr_gauge = Gauge('change_failure_rate_percentage', 'Change Failure Rate (CFR) in percentage', ['environment'])
lead_time_gauge = Gauge('average_lead_time_for_changes_seconds', 'Average Lead Time for Changes in seconds', ['environment'])
mttr_gauge = Gauge('mean_time_to_recovery_seconds', 'Mean Time to Recovery (MTTR) in seconds', ['environment'])

# Mapping of workflow filenames to environment labels
WORKFLOW_ENV_MAPPING = {
    'Staging': 'cd.staging.yml',
    'Prod': 'cd.prod.yml',
    'Dev': 'cd.dev.yml',
    'CI': 'ci.yml',
    'PR Deploy': 'pr-deploy.yml'
}

# Store the last processed timestamp for each environment
last_processed_timestamps = {}

def fetch_workflow_runs(workflow_filename):
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/actions/workflows/{workflow_filename}/runs"
    headers = {
        'Authorization': f'Bearer {GITHUB_TOKEN}',
        'Accept': 'application/vnd.github+json',
        'X-GitHub-Api-Version': '2022-11-28'
    }
    params = {'status': 'completed', 'per_page': 100}

    runs = []
    page = 1

    while True:
        response = requests.get(url, headers=headers, params={**params, 'page': page})
        if response.status_code != 200:
            print(f"Failed to retrieve workflow runs for {workflow_filename}: {response.status_code} - {response.text}")
            break

        result = response.json()
        runs.extend(result.get('workflow_runs', []))

        if 'next' not in response.links:
            break

        page += 1

    return runs

def get_commits():
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/commits"
    headers = {
        'Authorization': f'Bearer {GITHUB_TOKEN}',
        'Accept': 'application/vnd.github+json',
        'X-GitHub-Api-Version': '2022-11-28'
    }
    params = {'per_page': 100}

    commits = []
    page = 1

    while True:
        response = requests.get(url, headers=headers, params={**params, 'page': page})
        if response.status_code != 200:
            print(f"Failed to retrieve commits: {response.status_code} - {response.text}")
            break

        commit_data = response.json()
        if not commit_data:
            break

        for commit in commit_data:
            commit_sha = commit['sha']
            commit_date = datetime.strptime(commit['commit']['committer']['date'], "%Y-%m-%dT%H:%M:%SZ")
            commits.append({'sha': commit_sha, 'date': commit_date})

        page += 1

    return commits

def get_deployment_data(workflow_filename, last_timestamp):
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/actions/workflows/{workflow_filename}/runs"
    headers = {
        'Authorization': f'Bearer {GITHUB_TOKEN}',
        'Accept': 'application/vnd.github+json',
        'X-GitHub-Api-Version': '2022-11-28'
    }
    params = {'status': 'completed', 'per_page': 100}

    deployments = []
    page = 1

    while True:
        response = requests.get(url, headers=headers, params={**params, 'page': page})
        if response.status_code != 200:
            print(f"Failed to retrieve deployments for {workflow_filename}: {response.status_code} - {response.text}")
            break

        run_data = response.json().get('workflow_runs', [])
        if not run_data:
            break

        for run in run_data:
            run_date = datetime.strptime(run['created_at'], "%Y-%m-%dT%H:%M:%SZ")
            if run['conclusion'] == 'success' and (last_timestamp is None or run_date > last_timestamp):
                deployments.append({'sha': run['head_sha'], 'date': run_date})

        page += 1

    return deployments

def calculate_cfr(runs):
    total_deployments = len(runs)
    failed_deployments = sum(1 for run in runs if run['conclusion'] == 'failure')
    return (failed_deployments / total_deployments * 100) if total_deployments > 0 else 0

def calculate_average_lead_time(commits, deployments):
    if not commits or not deployments:
        return 0

    commit_dict = {commit['sha']: commit['date'] for commit in commits}
    deployment_dict = {deployment['sha']: deployment['date'] for deployment in deployments}

    lead_times = [
        (deploy_date - commit_dict[sha]).total_seconds()
        for sha, deploy_date in deployment_dict.items() if sha in commit_dict
    ]

    return sum(lead_times) / len(lead_times) if lead_times else 0

def calculate_mttr(runs):
    failed_deployments = []
    successful_deployments = []

    for run in runs:
        run_date = datetime.strptime(run['created_at'], "%Y-%m-%dT%H:%M:%SZ")
        if run['conclusion'] == 'failure':
            failed_deployments.append(run_date)
        elif run['conclusion'] == 'success':
            successful_deployments.append(run_date)

    mttr_values = [
        (min(next_successful_deployments) - failure_date).total_seconds()
        for failure_date in failed_deployments
        if (next_successful_deployments := [date for date in successful_deployments if date > failure_date])
    ]

    return sum(mttr_values) / len(mttr_values) if mttr_values else 0

def measure_metrics():
    global last_processed_timestamps

    for environment, workflow_filename in WORKFLOW_ENV_MAPPING.items():
        print(f"Processing Workflow: {workflow_filename}")

        # Initialize deployment frequency count
        successful_deployment_count = 0

        # Measure Deployment Frequency
        last_timestamp = last_processed_timestamps.get(environment, None)
        deployments = get_deployment_data(workflow_filename, last_timestamp)
        if deployments:
            if last_timestamp is None:
                # If last_timestamp is None, consider all deployments as new
                successful_deployment_count = len(deployments)
            else:
                latest_timestamp = max(d['date'] for d in deployments)
                successful_deployment_count = len([d for d in deployments if d['date'] > last_timestamp])
                last_processed_timestamps[environment] = latest_timestamp

            deployment_frequency_gauge.labels(environment=environment).set(successful_deployment_count)

        # Measure Change Failure Rate (CFR)
        runs = fetch_workflow_runs(workflow_filename)
        cfr = calculate_cfr(runs)
        cfr_gauge.labels(environment=environment).set(cfr)

        # Measure Lead Time for Changes (LTC)
        commits = get_commits()
        deployments = get_deployment_data(workflow_filename, last_timestamp)
        average_lead_time = calculate_average_lead_time(commits, deployments)
        lead_time_gauge.labels(environment=environment).set(average_lead_time)

        # Measure Mean Time to Recovery (MTTR)
        runs = fetch_workflow_runs(workflow_filename)
        average_mttr = calculate_mttr(runs)
        mttr_gauge.labels(environment=environment).set(average_mttr)

        print(f"Environment: {environment}")
        print(f"  Deployment Frequency: {successful_deployment_count}")
        print(f"  Change Failure Rate: {cfr:.2f}%")
        print(f"  Average Lead Time for Changes: {average_lead_time:.2f} seconds")
        print(f"  Mean Time to Recovery: {average_mttr:.2f} seconds\n")

if __name__ == '__main__':
    # Start the Prometheus HTTP server on the specified port 
    start_http_server(int(PORT))
    print(f"Prometheus server started on port {PORT}")

    # Measure and report metrics every 6 minutes
    while True:
        measure_metrics()
        time.sleep(360)

Explanation of the python script

  1. Environment Variable Loading: The script loads environment variables (e.g., GITHUB_TOKEN, REPO_OWNER, REPO_NAME, PORT) from a .env file using the dotenv package.

  2. Prometheus Metrics Setup:

It defines four Prometheus metrics: deployment_frequency_gauge: Tracks the number of successful deployments. cfr_gauge: Tracks the Change Failure Rate (percentage of failed deployments). lead_time_gauge: Tracks the Average Lead Time for Changes. mttr_gauge: Tracks the Mean Time to Recovery after a failure.

  1. Workflow Environment Mapping:

The script uses a dictionary (WORKFLOW_ENV_MAPPING) to map environment names to their corresponding GitHub Actions workflow filenames.

  1. GitHub API Interaction:

The script interacts with the GitHub API to: Fetch workflow runs for specific environments. Retrieve the commit history of the repository. Gather deployment data, including timestamps and statuses of runs.

  1. Metric Calculation:
  • **Deployment Frequency**: Counts the number of successful deployments within a certain time frame.
  • **Change Failure Rate (CFR)**: Calculates the percentage of failed deployments relative to the total number of deployments.
  • **Lead Time for Changes (LTC)**: Calculates the average time between a commit and its deployment.
  • **Mean Time to Recovery (MTTR)**: Measures the average time it takes to recover from a failed deployment (time between a failure and the next successful deployment). Prometheus Server:

The script starts an HTTP server on a specified port (defined in PORT), allowing Prometheus to scrape the metrics.

  1. Continuous Monitoring:

The script enters an infinite loop where it measures and reports metrics every 6 minutes (360 seconds).

  1. Error Handling: It includes basic error handling is included for failed API requests (e.g., printing error messages).

requirements.txt

requests==2.31.0
prometheus_client==0.16.0
python-dotenv==1.0.0

The requirements.txt file contain the necessary depencencies

anchor-dora.service

[Unit]
Description=Anchor Dora Exporter
After=network.target

[Service]
Type=simple
User=[user]
Group=[group]
WorkingDirectory=/path/to /directory/
ExecStart=/pathto//venv/bin/python3 /home/user/path /to file/main.py
Restart=always
RestartSec=10


[Install]
WantedBy=multi-user.target

Note that the user, group, WorkingDirectory and ExecStart should be configured to fit your use case.

Start the systemd service and enable it with the following command:

systemctl start anchor-dora.service
systemctl enable anchor-dora.service
systemctl status anchor-dora.service

Step 3. Edit the Prometheus config file.

We will edit the prometheus config.yml file to include a new job for our metrics:

  • Navigate to the etc/prometheus/prometheus.yml Append the following to the end of the file:
- job_name: 'anchor-dora-exporter'
    static_configs:
      - targets: ['localhost:8084']

Step 4. Set up Grafana Dashboards

Now that Prometheus is scraping the DORA metrics, we can create dashboards in Grafana to visualize these metrics.

  1. Add Prometheus as a Data Source in Grafana:
  • Log in to your Grafana instance.
  • Navigate to Configuration > Data Sources from the left sidebar.
  • Click Add data source and select Prometheus.
  • Set the URL to http://localhost:9091 (or wherever your Prometheus instance is running).
  • Click Save & Test to confirm the data source is working.
  1. Create a Dashboard:
  • From the Grafana home screen, click + (Create) and select Dashboard.
  • Click Add new panel to begin setting up your first panel.
  1. Set Up Panels for Each Metric:

For each DORA metric, you can create a separate panel:

  • Deployment Frequency:
    • In the Query section enter:
sum(deployment_frequency{environment="Staging"}) by (environment)
- In the Visualization section, select a Time series graph or a Stat panel, depending on how you want to visualize the data.
  • Change Failure Rate (CFR):
    • In the Query section, enter:
avg(cfr{environment="Staging"}) by (environment)
- Again, choose your preferred visualization.
  • Average Lead Time for Changes:
    • In the Query section, enter:
avg(lead_time_for_changes{environment="Staging"}) by (environment)
- Select the appropriate visualization.
  • Mean Time to Recovery (MTTR):
    • In the Query section, enter:
avg(mttr{environment="Staging"}) by (environment)
- Select the appropriate visualization.
  1. Repeat for Other Environments:

If you have multiple environments (like Production, Staging, etc.), repeat the above steps, changing the environment label in the PromQL queries accordingly.

  1. Save the Dashboard

Once you have set up all the panels, give your dashboard a meaningful name (e.g., "DORA Metrics Dashboard") and click Save.

Clone this wiki locally