Server Monitoring And Alerts

Setting Up Monitoring Infrastructure with Node Exporter, Prometheus, and Grafana

Introduction

This guide provides step-by-step instructions for setting up a comprehensive monitoring infrastructure using Node Exporter, Prometheus, and Grafana. This setup is designed to enable efficient system metric collection, storage, and visualization, with a focus on accommodating the needs of a Python DevOps team.

Key components covered in this documentation:

Node Exporter: For collecting and exposing system metrics
Prometheus: For scraping and storing time-series data
Grafana: For visualizing metrics and setting up alerts

The guide is structured to walk you through the installation and configuration of each component, including:

Detailed installation steps for each tool
Creation of necessary system users and directories
Configuration of systemd service files for automatic startup
Basic setup of Prometheus to scrape Node Exporter metrics
Initial Grafana configuration

Additionally, this document includes a section on Grafana user management and access control, ensuring that you can properly manage user access and permissions within your monitoring setup.

By following this guide, you'll establish a robust monitoring system capable of tracking system performance, visualizing key metrics, and setting up alerts for your DevOps environment.

Node exporter Installation

Lets start with the installation of the node exporter

Download and Install Node exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz
sudo mv node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin
rm -rf node_exporter-1.8.2.linux-amd64.tar.gz

Create a system user for Node Exporter:

sudo useradd -rs /bin/false node_exporter

The -rs flags in the useradd command serve the following purposes:

-r: This option adds a system account. System accounts are typically used for services and have user IDs lower than 1000. -s /bin/false: This sets the user's login shell to /bin/false, preventing the user from logging in interactively.

Create a systemd service file

sudo nano /etc/systemd/system/node_exporter.service

Add the following contents to the service file:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start and enable the systemd service node_exporter.service

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Now that the node exporter has been installed, we will set up prometheus and grafana for the python-devops team: Prometheus setup

Switch to the python-devops account:

sudo su - python-devops

Install Prometheus for the Python Devops team:

Download and Install prometheus

wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
tar xvfz prometheus-2.53.1.linux-amd64.tar.gz
sudo mv prometheus-2.53.1.linux-amd64 /opt/prometheus-python
rm -rf prometheus-2.53.1.linux-amd64.tar.gz

Create a system user for Prometheus

sudo useradd -rs /bin/false prometheus

Create necessary directories and set permissions

sudo mkdir /etc/prometheus /var/lib/prometheus-python
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus-python
sudo chown -R prometheus:prometheus /opt/prometheus-python/

The command sudo mkdir /etc/prometheus /var/lib/prometheus creates two directories:

/etc/prometheus: This is typically used for configuration files related to Prometheus. This is where your Prometheus configuration file (prometheus.yml) is placed.

/var/lib/prometheus: This directory is generally used for storing Prometheus data, such as time series data.

The command sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus-python changes the ownership of the directories. This ensures that the prometheus user has the necessary permissions to read and write to these directories, which is essential for Prometheus to operate correctly when running under this specific user.

Finally, create a configuration file in the etc/prometheus

sudo nano /etc/prometheus/prometheus.yml

Add the following to the configuration file:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

This configuration tells Prometheus scrape metrics from the Node Exporter instance on localhost:9091 every 15 seconds and also store the scraped metrics in the TSDB (time-series database) for 30 days.

Create a systemd service file for the prometheus-python.service:

sudo nano /etc/systemd/system/prometheus-python.service

Add the following content:

[Unit]
Description=Prometheus for Python DevOps Team
After=network.target

[Service]
User=prometheus-python
Group=prometheus-python
Type=simple
ExecStart=/opt/prometheus-python/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus-python/ \
  --web.console.templates=/opt/prometheus-python/consoles \
  --web.console.libraries=/opt/prometheus-python/console_libraries \
  --web.listen-address=:9091
  --storage.tsdb.retention.time=1y
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

This systemd service file that configures a Prometheus instance for the Python DevOps Team.

Service Description

Starts a Prometheus instance specifically for the Python DevOps Team.
Service Configuration
Runs as the prometheususer and group.
Uses the prometheus.yml configuration file located in /etc/prometheus/.
Stores metrics in /var/lib/prometheus-python/.
Serves the Prometheus web console on port 9091.
Service Management
Starts the service after the network target is reached.
Enables the service to be started automatically on boot. The line WantedBy=multi-user.target in the [Install] section enables the service to be started automatically on boot.

Start and enable the service:

sudo systemctl daemon-reload
sudo systemctl start prometheus-python
sudo systemctl enable prometheus-python

Grafana setup Install Grafana for the python-devops team

Download and Install Grafana

wget https://dl.grafana.com/oss/release/grafana-11.1.3.linux-amd64.tar.gz
tar -zxvf grafana-11.1.3.linux-amd64.tar.gz
sudo mv grafana-v11.1.3 /opt/grafana-python
rm -rf grafana-11.1.3.linux-amd64.tar.gz

Create a system user for Grafana

sudo useradd -rs /bin/false grafana-python

Create necessary directories and set permissions

sudo mkdir /etc/grafana-python /var/lib/grafana-python
sudo chown grafana-python:grafana-python /etc/grafana-python /var/lib/grafana-python

Create a configuration file

sudo nano /etc/grafana-python/grafana.ini

Add the following content to the configuration file:

[server]
http_port = 3001

[paths]
data = /var/lib/grafana-python
logs = /var/log/grafana-python
plugins = /var/lib/grafana-python/plugins

Create a systemd service file:

sudo nano /etc/systemd/system/grafana-python.service

Add the following content:

[Unit]
Description=Grafana for Python DevOps Team
After=network.target

[Service]
User=grafana-python
Group=grafana-python
Type=simple
ExecStart=/opt/grafana-python/bin/grafana-server \
  -config /etc/grafana-python/grafana.ini \
  -homepath /opt/grafana-python
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Start and enable the service:

sudo systemctl daemon-reload
sudo systemctl start grafana-python
sudo systemctl enable grafana-python

Prometheus Configuration

Now that Node Exporter is running, we need to configure Prometheus to scrape its metrics. Add the following job to your prometheus.yml file:

   scrape_configs:
     - job_name: 'node'
     static_configs:
       - targets: ['localhost:9100']

Restart prometheus after adding these changes:

   sudo systemctl restart prometheus

Grafana Configuration

One of the great advantages of using Node Exporter is the availability of pre-built dashboards. We would need to import a comprehensive dashboard for our Node Exporter metrics.

Sign in to your grafana url, e.g., https://api-python.boilerplate.hng.tech:3001
Click on dashboards and create a new dashboard. Click import dashboard and enter '1860' in the dashboard ID
Select your Prometheus data source in the dropdown. Click "Import" to finalize.

You should now see a detailed dashboard with various panels showing CPU, memory, disk, and network metrics.

Setting Up Grafana Alerting

Now that we have our metrics visualized, let's set up some alerts to notify us when things go wrong.

Create a contact point and custom notification templates

Creating Contact Points and Custom Notification Templates in Grafana Contact points in Grafana are the destinations where your alerts will be sent. These can be various communication channels such as email, Slack, PagerDuty, webhook, and more. Our communication channel is slack so we will be creating a slack app and webhook url

Follow these steps here to create a slack app bot and webhook url
Navigate to contact points and set the integration to slack and include the slack url
In the optional slack settings, set up a custom title and message custom title The custom title template creates a concise, informative title for each alert:

{{ define "alerts.title" -}}
{{ if .Alerts.Firing -}}
{{ range .Alerts.Firing }}
Anchor-Python Alert: {{ .Labels.alertname }} - Severity: {{ index .Labels "severity" | toUpper }} 
{{ end }}
{{- end }}
{{ if .Alerts.Resolved -}}
{{ range .Alerts.Resolved }}
Anchor-Python Alert: {{ .Labels.alertname }} - Severity: {{ index .Labels "severity" | toUpper }} 
{{ end }}
{{- end }}
{{- end }}

This template creates a separate title for each alert, includes the alert name and severity, and distinguishes between firing and resolved alerts

custom message

The message template provides more detailed information about each alert:

{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
🚨 {{ len .Alerts.Firing }} Alert(s) Firing 🚨
{{ range .Alerts.Firing }}
---
🔔 *Alert:* {{ .Labels.alertname }}
📊 *Severity:* {{ index .Labels "severity" | toUpper }}
📝 *Summary:* {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
🔍 *Description:* {{ index .Annotations "description" }}
{{- end }}
⏰ *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{- end }}
{{- if .Alerts.Resolved -}}
✅ {{ len .Alerts.Resolved }} Alert(s) Resolved
{{ range .Alerts.Resolved }}
---
🔔 *Alert:* {{ .Labels.alertname }}
📝 *Summary:* {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
🔍 *Description:* {{ index .Annotations "description" }}
{{- end }}
⏰ *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- if .EndsAt.After .StartsAt }}
🏁 *Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{ end }}
{{- end }}
{{- end }}

This custom message shows the total number of firing/resolved alerts, provides detailed information for each alert, including name, severity, summary, and timing, uses emojis and formatting for improved readability, separates alerts with horizontal lines for clarity.

Use the "Test" button to send a sample notification.

Creating an Alert Rule

Navigate to Alerting > Alert rules. Click "New alert rule"

We set up 8 alert rules for different system metrics:

High CPU Usage

Query:

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))

This query computes the CPU usage percentage by subtracting the average idle CPU time rate from 1, multiplying by 100 to convert it into a percentage

The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the CPU usage exceeds 80%. Math triggers an additional alert when the CPU usage exceeds 90%.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the CPU usage is above 90%, and Severity: Warning when it is above 80%

High Memory Usage

Query:

(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100

This query calculates the percentage of used memory by determining the proportion of memory that is not available (i.e., used), and then converting this fraction into a percentage

The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the Memory percentage exceeds 80%. Math triggers an additional alert when the Memory percentage exceeds 90%.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the CPU usage is above 90%, and Severity: Warning when it is above 80%

Disk Usage

Query:

100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})

This query calculates the percentage of used space on the filesystem mounted at "/", excluding filesystems of type "rootfs". It determines the proportion of the filesystem that is occupied by subtracting the available space percentage from 100.

The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the Disk usage exceeds 80%. Math triggers an additional alert when the Disk usage exceeds 90%.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the CPU usage is above 90%, and Severity: Warning when it is above 80%

Network traffic

Query:

irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8

This query calculates the near-instantaneous rate of network data transmission in bits per second over a 5-minute period. It provides an understanding of the network throughput for the specified instance, which is helpful for monitoring network performance and detecting potential issues like network congestion.

The reduce function captures the last value of the computed network data. Threshold triggers an alert when the network data transmitted is over 100MBi/s

Network error

Query:

increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])

This query calculates the total number of network transmission and reception errors over the past hour. It provides a comprehensive view of network reliability and helps in identifying and diagnosing network issues.

The reduce function captures the last value of the computed network transmission errors. Threshold triggers an alert when the network errors is above 2. Math triggers an additional alert when the network errors is above 5.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 5) -}}
Critical
{{ else if (gt $values.B.Value 2) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the network error is above 5, and Severity: Warning when it is above 2

System load

Query:

scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 /count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))

This query calculates the 1-minute CPU load average as a percentage of the total CPU capacity. It provides an indication of how heavily the CPUs are being utilized relative to their total capacity.

The reduce function captures the last value of the computed system load. Threshold triggers an alert when the disk I/O is above 80. Math triggers an additional alert when the disk I/O is above 90.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the system load is above 90, and Severity: Warning when it is above 80

Grafana User Management and Access Control

Creating Users

Navigate to "Server Admin" > "Users"
Click "Invite User" or "New User"
Fill in required details:
- Name
- Email
- Username (if "New User")
- Password (if "New User")
Set organization and role
Click "Submit"

Access Control

Roles

Admin: Full access to all organizations
Editor: Can edit and create dashboards
Viewer: Can view dashboards

Organization Roles

Admin: Manage users and permissions within the organization
Editor: Edit and create dashboards
Viewer: View dashboards

Teams

Go to "Configuration" > "Teams"
Click "New Team"
Add members and set permissions

Folder Permissions

Navigate to the desired folder
Click the gear icon > "Permissions"
Add users, teams, or roles
Set appropriate access levels

Dashboard Permissions

Open the dashboard
Click the gear icon > "Permissions"
Add users, teams, or roles
Set viewer, editor, or admin access

Prepared By Devops Python Team (Name - Slack Display Name)

Nwanochie Emmanuel - nwanochie
Omolara Adeboye - laraadeboye
Sarah Aligbe - Sarah Aligbe
Okesanya Samuel - DrInTech
Rotimi Alabi - rtmabiola
Aisha Muhammad - aishaM
Divine Onyekwuluje - Divine4212

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server Monitoring And Alerts

Setting Up Monitoring Infrastructure with Node Exporter, Prometheus, and Grafana

Introduction

Node exporter Installation

Prometheus Configuration

Grafana Configuration

Setting Up Grafana Alerting

Create a contact point and custom notification templates

Creating an Alert Rule

High CPU Usage

High Memory Usage

Disk Usage

Network traffic

Network error

System load

Grafana User Management and Access Control

Creating Users

Access Control

Roles

Organization Roles

Teams

Folder Permissions

Dashboard Permissions

Prepared By Devops Python Team (Name - Slack Display Name)

Home

Setting up the remote server and installing prerequisites

Deployment with Systemd

NGINX Reverse Proxy Setup and SSL Configuration

CI CD Pipeline Configuration for the Python Application

Server Monitoring And Alerts

Clone this wiki locally