-
Notifications
You must be signed in to change notification settings - Fork 213
Dockerization, Updated Pipeline Workflow and DORA Metrics
This document outlines the process of Dockerizing our existing applications, setting up environment-specific configurations, and refactoring our Continuous Integration/Continuous Deployment (CI/CD) pipeline to meet our new deployment needs.
We created a Dockerfile to build our application into an image and environment-specific docker compose files to deploy the container in various environments. The docker compose files build a container for our code's backend and database. The containers are then mapped to the nginx server we have running in our server
Our old pipeline involved two workflows to build and test our application on pull requests made to the repo and then deploy the application to our server once the PR is merged and the build and test workflow is successful. We had environment-specific workflows for the dev, staging, and main branch (which is our production branch)
Our new pipeline workflow involves three workflows. The first is to test the application to ensure it meets standards, then a pr-deploy workflow is ran to deploy our application as a docker container in an isolated environment. Once these workflows pass and the PR is merged, the last worklow runs which builds our application image on the pipeline and uses docker save
and gzip
to compress the image. The image is then securely copied into our server and docker load
is used to load the image and docker-compose up
starts the container
The DevOps Research and Assessment Metrics (DORA) are used to measure DevOps performance. These metrics help to improve DevOps efficiency and communicate performance to business stakeholders. These four key metrics, which are divided into two core areas od DevOps are being monitored for our project: They are:
-
Deployment frequency and Lead time for changes (which measure team velocity)
-
Change failure rate and Time to restore service (which measure stability)
To access these core metrics for our project, we will be making use of the Github API to collect metrics about our repository's workflow runs, pull requests, and commits. We will also write a python script to that uses the Prometheus client library to expose these metrics, which can be scraped by a Prometheus server for monitoring and alerting purposes with grafana.
mkdir anchor-dora-exporter
Within this folder, create the files for the python code, the dependencies and the systemd service named main.py
, requirements.txt
and anchor-dora.service
respectively.
Also, create a file named .env
from which the script will reference the environment variables specified.
.env
Sample .env
file
GITHUB_TOKEN=[Input github token here]
REPO_OWNER=hngprojects
REPO_NAME=hng_boilerplate_python_fastapi_web
PORT=8084
The contents of the files are shown below:
main.py
import requests
from datetime import datetime
from prometheus_client import start_http_server, Gauge
import time
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# GitHub API variables from environment variables
GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')
REPO_OWNER = os.getenv('REPO_OWNER')
REPO_NAME = os.getenv('REPO_NAME')
PORT = os.getenv('PORT')
# Prometheus metrics
deployment_frequency_gauge = Gauge('deployment_frequency', 'Total successful deployments', ['environment'])
cfr_gauge = Gauge('change_failure_rate_percentage', 'Change Failure Rate (CFR) in percentage', ['environment'])
lead_time_gauge = Gauge('average_lead_time_for_changes_seconds', 'Average Lead Time for Changes in seconds', ['environment'])
mttr_gauge = Gauge('mean_time_to_recovery_seconds', 'Mean Time to Recovery (MTTR) in seconds', ['environment'])
# Mapping of workflow filenames to environment labels
WORKFLOW_ENV_MAPPING = {
'Staging': 'cd.staging.yml',
'Prod': 'cd.prod.yml',
'Dev': 'cd.dev.yml',
'CI': 'ci.yml',
'PR Deploy': 'pr-deploy.yml'
}
# Store the last processed timestamp for each environment
last_processed_timestamps = {}
def fetch_workflow_runs(workflow_filename):
url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/actions/workflows/{workflow_filename}/runs"
headers = {
'Authorization': f'Bearer {GITHUB_TOKEN}',
'Accept': 'application/vnd.github+json',
'X-GitHub-Api-Version': '2022-11-28'
}
params = {'status': 'completed', 'per_page': 100}
runs = []
page = 1
while True:
response = requests.get(url, headers=headers, params={**params, 'page': page})
if response.status_code != 200:
print(f"Failed to retrieve workflow runs for {workflow_filename}: {response.status_code} - {response.text}")
break
result = response.json()
runs.extend(result.get('workflow_runs', []))
if 'next' not in response.links:
break
page += 1
return runs
def get_commits():
url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/commits"
headers = {
'Authorization': f'Bearer {GITHUB_TOKEN}',
'Accept': 'application/vnd.github+json',
'X-GitHub-Api-Version': '2022-11-28'
}
params = {'per_page': 100}
commits = []
page = 1
while True:
response = requests.get(url, headers=headers, params={**params, 'page': page})
if response.status_code != 200:
print(f"Failed to retrieve commits: {response.status_code} - {response.text}")
break
commit_data = response.json()
if not commit_data:
break
for commit in commit_data:
commit_sha = commit['sha']
commit_date = datetime.strptime(commit['commit']['committer']['date'], "%Y-%m-%dT%H:%M:%SZ")
commits.append({'sha': commit_sha, 'date': commit_date})
page += 1
return commits
def get_deployment_data(workflow_filename, last_timestamp):
url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/actions/workflows/{workflow_filename}/runs"
headers = {
'Authorization': f'Bearer {GITHUB_TOKEN}',
'Accept': 'application/vnd.github+json',
'X-GitHub-Api-Version': '2022-11-28'
}
params = {'status': 'completed', 'per_page': 100}
deployments = []
page = 1
while True:
response = requests.get(url, headers=headers, params={**params, 'page': page})
if response.status_code != 200:
print(f"Failed to retrieve deployments for {workflow_filename}: {response.status_code} - {response.text}")
break
run_data = response.json().get('workflow_runs', [])
if not run_data:
break
for run in run_data:
run_date = datetime.strptime(run['created_at'], "%Y-%m-%dT%H:%M:%SZ")
if run['conclusion'] == 'success' and (last_timestamp is None or run_date > last_timestamp):
deployments.append({'sha': run['head_sha'], 'date': run_date})
page += 1
return deployments
def calculate_cfr(runs):
total_deployments = len(runs)
failed_deployments = sum(1 for run in runs if run['conclusion'] == 'failure')
return (failed_deployments / total_deployments * 100) if total_deployments > 0 else 0
def calculate_average_lead_time(commits, deployments):
if not commits or not deployments:
return 0
commit_dict = {commit['sha']: commit['date'] for commit in commits}
deployment_dict = {deployment['sha']: deployment['date'] for deployment in deployments}
lead_times = [
(deploy_date - commit_dict[sha]).total_seconds()
for sha, deploy_date in deployment_dict.items() if sha in commit_dict
]
return sum(lead_times) / len(lead_times) if lead_times else 0
def calculate_mttr(runs):
failed_deployments = []
successful_deployments = []
for run in runs:
run_date = datetime.strptime(run['created_at'], "%Y-%m-%dT%H:%M:%SZ")
if run['conclusion'] == 'failure':
failed_deployments.append(run_date)
elif run['conclusion'] == 'success':
successful_deployments.append(run_date)
mttr_values = [
(min(next_successful_deployments) - failure_date).total_seconds()
for failure_date in failed_deployments
if (next_successful_deployments := [date for date in successful_deployments if date > failure_date])
]
return sum(mttr_values) / len(mttr_values) if mttr_values else 0
def measure_metrics():
global last_processed_timestamps
for environment, workflow_filename in WORKFLOW_ENV_MAPPING.items():
print(f"Processing Workflow: {workflow_filename}")
# Initialize deployment frequency count
successful_deployment_count = 0
# Measure Deployment Frequency
last_timestamp = last_processed_timestamps.get(environment, None)
deployments = get_deployment_data(workflow_filename, last_timestamp)
if deployments:
if last_timestamp is None:
# If last_timestamp is None, consider all deployments as new
successful_deployment_count = len(deployments)
else:
latest_timestamp = max(d['date'] for d in deployments)
successful_deployment_count = len([d for d in deployments if d['date'] > last_timestamp])
last_processed_timestamps[environment] = latest_timestamp
deployment_frequency_gauge.labels(environment=environment).set(successful_deployment_count)
# Measure Change Failure Rate (CFR)
runs = fetch_workflow_runs(workflow_filename)
cfr = calculate_cfr(runs)
cfr_gauge.labels(environment=environment).set(cfr)
# Measure Lead Time for Changes (LTC)
commits = get_commits()
deployments = get_deployment_data(workflow_filename, last_timestamp)
average_lead_time = calculate_average_lead_time(commits, deployments)
lead_time_gauge.labels(environment=environment).set(average_lead_time)
# Measure Mean Time to Recovery (MTTR)
runs = fetch_workflow_runs(workflow_filename)
average_mttr = calculate_mttr(runs)
mttr_gauge.labels(environment=environment).set(average_mttr)
print(f"Environment: {environment}")
print(f" Deployment Frequency: {successful_deployment_count}")
print(f" Change Failure Rate: {cfr:.2f}%")
print(f" Average Lead Time for Changes: {average_lead_time:.2f} seconds")
print(f" Mean Time to Recovery: {average_mttr:.2f} seconds\n")
if __name__ == '__main__':
# Start the Prometheus HTTP server on the specified port
start_http_server(int(PORT))
print(f"Prometheus server started on port {PORT}")
# Measure and report metrics every 6 minutes
while True:
measure_metrics()
time.sleep(360)
-
Environment Variable Loading: The script loads environment variables (e.g., GITHUB_TOKEN, REPO_OWNER, REPO_NAME, PORT) from a .env file using the dotenv package.
-
Prometheus Metrics Setup:
It defines four Prometheus metrics:
deployment_frequency_gauge
: Tracks the number of successful deployments.
cfr_gauge
: Tracks the Change Failure Rate (percentage of failed deployments).
lead_time_gauge
: Tracks the Average Lead Time for Changes.
mttr_gauge
: Tracks the Mean Time to Recovery after a failure.
- Workflow Environment Mapping:
The script uses a dictionary (WORKFLOW_ENV_MAPPING) to map environment names to their corresponding GitHub Actions workflow filenames.
- GitHub API Interaction:
The script interacts with the GitHub API to: Fetch workflow runs for specific environments. Retrieve the commit history of the repository. Gather deployment data, including timestamps and statuses of runs.
- Metric Calculation:
-
**Deployment Frequency**
: Counts the number of successful deployments within a certain time frame. -
**Change Failure Rate (CFR)**
: Calculates the percentage of failed deployments relative to the total number of deployments. -
**Lead Time for Changes (LTC)**
: Calculates the average time between a commit and its deployment. -
**Mean Time to Recovery (MTTR)**
: Measures the average time it takes to recover from a failed deployment (time between a failure and the next successful deployment). Prometheus Server:
The script starts an HTTP server on a specified port (defined in PORT), allowing Prometheus to scrape the metrics.
- Continuous Monitoring:
The script enters an infinite loop where it measures and reports metrics every 6 minutes (360 seconds).
- Error Handling: It includes basic error handling is included for failed API requests (e.g., printing error messages).
requirements.txt
requests==2.31.0
prometheus_client==0.16.0
python-dotenv==1.0.0
The requirements.txt
file contain the necessary depencencies
anchor-dora.service
[Unit]
Description=Anchor Dora Exporter
After=network.target
[Service]
Type=simple
User=[user]
Group=[group]
WorkingDirectory=/path/to /directory/
ExecStart=/pathto//venv/bin/python3 /home/user/path /to file/main.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Note that the user, group, WorkingDirectory and ExecStart should be configured to fit your use case.
Start the systemd service and enable it with the following command:
systemctl start anchor-dora.service
systemctl enable anchor-dora.service
systemctl status anchor-dora.service
We will edit the prometheus config.yml file to include a new job for our metrics:
- Navigate to the etc/prometheus/prometheus.yml Append the following to the end of the file:
- job_name: 'anchor-dora-exporter'
static_configs:
- targets: ['localhost:8084']
Now that Prometheus is scraping the DORA metrics, we can create dashboards in Grafana to visualize these metrics.
- Add Prometheus as a Data Source in Grafana:
- Log in to your Grafana instance.
- Navigate to Configuration > Data Sources from the left sidebar.
- Click Add data source and select Prometheus.
- Set the URL to http://localhost:9091 (or wherever your Prometheus instance is running).
- Click Save & Test to confirm the data source is working.
- Create a Dashboard:
- From the Grafana home screen, click + (Create) and select Dashboard.
- Click Add new panel to begin setting up your first panel.
- Set Up Panels for Each Metric:
For each DORA metric, you can create a separate panel:
-
Deployment Frequency:
- In the Query section enter:
sum(deployment_frequency{environment="Staging"}) by (environment)
- In the Visualization section, select a Time series graph or a Stat panel, depending on how you want to visualize the data.
-
Change Failure Rate (CFR):
- In the Query section, enter:
avg(cfr{environment="Staging"}) by (environment)
- Again, choose your preferred visualization.
-
Average Lead Time for Changes:
- In the Query section, enter:
avg(lead_time_for_changes{environment="Staging"}) by (environment)
- Select the appropriate visualization.
-
Mean Time to Recovery (MTTR):
- In the Query section, enter:
avg(mttr{environment="Staging"}) by (environment)
- Select the appropriate visualization.
- Repeat for Other Environments:
If you have multiple environments (like Production, Staging, etc.), repeat the above steps, changing the environment
label in the PromQL queries accordingly.
- Save the Dashboard
Once you have set up all the panels, give your dashboard a meaningful name (e.g., "DORA Metrics Dashboard") and click Save.
- Introduction
- Server Setup
- PostgreSQL Setup
- NGINX installation
- RabbitMQ
- Cloning of repo and creating of app directories