-
Notifications
You must be signed in to change notification settings - Fork 213
Server Monitoring And Alerts
This guide provides step-by-step instructions for setting up a comprehensive monitoring infrastructure using Node Exporter, Prometheus, and Grafana. This setup is designed to enable efficient system metric collection, storage, and visualization, with a focus on accommodating the needs of a Python DevOps team.
Key components covered in this documentation:
- Node Exporter: For collecting and exposing system metrics
- Prometheus: For scraping and storing time-series data
- Grafana: For visualizing metrics and setting up alerts
The guide is structured to walk you through the installation and configuration of each component, including:
- Detailed installation steps for each tool
- Creation of necessary system users and directories
- Configuration of systemd service files for automatic startup
- Basic setup of Prometheus to scrape Node Exporter metrics
- Initial Grafana configuration
Additionally, this document includes a section on Grafana user management and access control, ensuring that you can properly manage user access and permissions within your monitoring setup.
By following this guide, you'll establish a robust monitoring system capable of tracking system performance, visualizing key metrics, and setting up alerts for your DevOps environment.
Lets start with the installation of the node exporter
- Download and Install Node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz
sudo mv node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin
rm -rf node_exporter-1.8.2.linux-amd64.tar.gz
- Create a system user for Node Exporter:
sudo useradd -rs /bin/false node_exporter
The -rs
flags in the useradd command serve the following purposes:
-r
: This option adds a system account. System accounts are typically used for services and have user IDs lower than 1000.
-s
/bin/false: This sets the user's login shell to /bin/false, preventing the user from logging in interactively.
- Create a systemd service file
sudo nano /etc/systemd/system/node_exporter.service
Add the following contents to the service file:
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Start and enable the systemd service node_exporter.service
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
Now that the node exporter has been installed, we will set up prometheus and grafana for the python-devops team: Prometheus setup
- Switch to the python-devops account:
sudo su - python-devops
- Install Prometheus for the Python Devops team:
- Download and Install prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
tar xvfz prometheus-2.53.1.linux-amd64.tar.gz
sudo mv prometheus-2.53.1.linux-amd64 /opt/prometheus-python
rm -rf prometheus-2.53.1.linux-amd64.tar.gz
- Create a system user for Prometheus
sudo useradd -rs /bin/false prometheus
- Create necessary directories and set permissions
sudo mkdir /etc/prometheus /var/lib/prometheus-python
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus-python
sudo chown -R prometheus:prometheus /opt/prometheus-python/
The command sudo mkdir /etc/prometheus
/var/lib/prometheus
creates two directories:
/etc/prometheus
: This is typically used for configuration files related to Prometheus. This is where your Prometheus configuration file (prometheus.yml
) is placed.
/var/lib/prometheus
: This directory is generally used for storing Prometheus data, such as time series data.
The command sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus-python
changes the ownership of the directories. This ensures that the prometheus
user has the necessary permissions to read and write to these directories, which is essential for Prometheus to operate correctly when running under this specific user.
- Finally, create a configuration file in the
etc/prometheus
sudo nano /etc/prometheus/prometheus.yml
Add the following to the configuration file:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
This configuration tells Prometheus scrape metrics from the Node Exporter instance on localhost:9091 every 15 seconds and also store the scraped metrics in the TSDB (time-series database) for 30 days.
- Create a systemd service file for the prometheus-python.service:
sudo nano /etc/systemd/system/prometheus-python.service
Add the following content:
[Unit]
Description=Prometheus for Python DevOps Team
After=network.target
[Service]
User=prometheus-python
Group=prometheus-python
Type=simple
ExecStart=/opt/prometheus-python/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus-python/ \
--web.console.templates=/opt/prometheus-python/consoles \
--web.console.libraries=/opt/prometheus-python/console_libraries \
--web.listen-address=:9091
--storage.tsdb.retention.time=1y
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
This systemd service file that configures a Prometheus instance for the Python DevOps Team.
Service Description
- Starts a Prometheus instance specifically for the Python DevOps Team.
Service Configuration - Runs as the prometheususer and group.
- Uses the prometheus.yml configuration file located in /etc/prometheus/.
- Stores metrics in /var/lib/prometheus-python/.
- Serves the Prometheus web console on port 9091.
Service Management - Starts the service after the network target is reached.
- Enables the service to be started automatically on boot. The line
WantedBy=multi-user.target
in the [Install] section enables the service to be started automatically on boot.
Start and enable the service:
sudo systemctl daemon-reload
sudo systemctl start prometheus-python
sudo systemctl enable prometheus-python
Grafana setup Install Grafana for the python-devops team
- Download and Install Grafana
wget https://dl.grafana.com/oss/release/grafana-11.1.3.linux-amd64.tar.gz
tar -zxvf grafana-11.1.3.linux-amd64.tar.gz
sudo mv grafana-v11.1.3 /opt/grafana-python
rm -rf grafana-11.1.3.linux-amd64.tar.gz
- Create a system user for Grafana
sudo useradd -rs /bin/false grafana-python
- Create necessary directories and set permissions
sudo mkdir /etc/grafana-python /var/lib/grafana-python
sudo chown grafana-python:grafana-python /etc/grafana-python /var/lib/grafana-python
- Create a configuration file
sudo nano /etc/grafana-python/grafana.ini
- Add the following content to the configuration file:
[server]
http_port = 3001
[paths]
data = /var/lib/grafana-python
logs = /var/log/grafana-python
plugins = /var/lib/grafana-python/plugins
- Create a systemd service file:
sudo nano /etc/systemd/system/grafana-python.service
- Add the following content:
[Unit]
Description=Grafana for Python DevOps Team
After=network.target
[Service]
User=grafana-python
Group=grafana-python
Type=simple
ExecStart=/opt/grafana-python/bin/grafana-server \
-config /etc/grafana-python/grafana.ini \
-homepath /opt/grafana-python
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
- Start and enable the service:
sudo systemctl daemon-reload
sudo systemctl start grafana-python
sudo systemctl enable grafana-python
Now that Node Exporter is running, we need to configure Prometheus to scrape its metrics. Add the following job to your prometheus.yml file:
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
Restart prometheus after adding these changes:
sudo systemctl restart prometheus
One of the great advantages of using Node Exporter is the availability of pre-built dashboards. We would need to import a comprehensive dashboard for our Node Exporter metrics.
- Sign in to your grafana url, e.g., https://api-python.boilerplate.hng.tech:3001
- Click on dashboards and create a new dashboard. Click import dashboard and enter '1860' in the dashboard ID
- Select your Prometheus data source in the dropdown. Click "Import" to finalize.
You should now see a detailed dashboard with various panels showing CPU, memory, disk, and network metrics.
Now that we have our metrics visualized, let's set up some alerts to notify us when things go wrong.
Creating Contact Points and Custom Notification Templates in Grafana Contact points in Grafana are the destinations where your alerts will be sent. These can be various communication channels such as email, Slack, PagerDuty, webhook, and more. Our communication channel is slack so we will be creating a slack app and webhook url
- Follow these steps here to create a slack app bot and webhook url
- Navigate to contact points and set the integration to slack and include the slack url
- In the optional slack settings, set up a custom title and message custom title The custom title template creates a concise, informative title for each alert:
{{ define "alerts.title" -}}
{{ if .Alerts.Firing -}}
{{ range .Alerts.Firing }}
Anchor-Python Alert: {{ .Labels.alertname }} - Severity: {{ index .Labels "severity" | toUpper }}
{{ end }}
{{- end }}
{{ if .Alerts.Resolved -}}
{{ range .Alerts.Resolved }}
Anchor-Python Alert: {{ .Labels.alertname }} - Severity: {{ index .Labels "severity" | toUpper }}
{{ end }}
{{- end }}
{{- end }}
This template creates a separate title for each alert, includes the alert name and severity, and distinguishes between firing and resolved alerts
custom message
The message template provides more detailed information about each alert:
{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
🚨 {{ len .Alerts.Firing }} Alert(s) Firing 🚨
{{ range .Alerts.Firing }}
---
🔔 *Alert:* {{ .Labels.alertname }}
📊 *Severity:* {{ index .Labels "severity" | toUpper }}
📝 *Summary:* {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
🔍 *Description:* {{ index .Annotations "description" }}
{{- end }}
⏰ *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{- end }}
{{- if .Alerts.Resolved -}}
✅ {{ len .Alerts.Resolved }} Alert(s) Resolved
{{ range .Alerts.Resolved }}
---
🔔 *Alert:* {{ .Labels.alertname }}
📝 *Summary:* {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
🔍 *Description:* {{ index .Annotations "description" }}
{{- end }}
⏰ *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- if .EndsAt.After .StartsAt }}
🏁 *Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{ end }}
{{- end }}
{{- end }}
This custom message shows the total number of firing/resolved alerts, provides detailed information for each alert, including name, severity, summary, and timing, uses emojis and formatting for improved readability, separates alerts with horizontal lines for clarity.
Use the "Test" button to send a sample notification.
Navigate to Alerting > Alert rules. Click "New alert rule"
We set up 8 alert rules for different system metrics:
Query:
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))
This query computes the CPU usage percentage by subtracting the average idle CPU time rate from 1, multiplying by 100 to convert it into a percentage
The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the CPU usage exceeds 80%. Math triggers an additional alert when the CPU usage exceeds 90%.
We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.
{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}
The alert message contains a Severity: Critical
when the CPU usage is above 90%, and Severity: Warning
when it is above 80%
Query:
(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100
This query calculates the percentage of used memory by determining the proportion of memory that is not available (i.e., used), and then converting this fraction into a percentage
The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the Memory percentage exceeds 80%. Math triggers an additional alert when the Memory percentage exceeds 90%.
We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.
{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}
The alert message contains a Severity: Critical
when the CPU usage is above 90%, and Severity: Warning
when it is above 80%
Query:
100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})
This query calculates the percentage of used space on the filesystem mounted at "/", excluding filesystems of type "rootfs". It determines the proportion of the filesystem that is occupied by subtracting the available space percentage from 100.
The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the Disk usage exceeds 80%. Math triggers an additional alert when the Disk usage exceeds 90%.
We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.
{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}
The alert message contains a Severity: Critical
when the CPU usage is above 90%, and Severity: Warning
when it is above 80%
Query:
irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8
This query calculates the near-instantaneous rate of network data transmission in bits per second over a 5-minute period. It provides an understanding of the network throughput for the specified instance, which is helpful for monitoring network performance and detecting potential issues like network congestion.
The reduce function captures the last value of the computed network data. Threshold triggers an alert when the network data transmitted is over 100MBi/s
Query:
increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])
This query calculates the total number of network transmission and reception errors over the past hour. It provides a comprehensive view of network reliability and helps in identifying and diagnosing network issues.
The reduce function captures the last value of the computed network transmission errors. Threshold triggers an alert when the network errors is above 2. Math triggers an additional alert when the network errors is above 5.
We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.
{{ if (gt $values.B.Value 5) -}}
Critical
{{ else if (gt $values.B.Value 2) -}}
Warning
{{ else -}}
low
{{- end }}
The alert message contains a Severity: Critical
when the network error is above 5, and Severity: Warning
when it is above 2
Query:
scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 /count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))
This query calculates the 1-minute CPU load average as a percentage of the total CPU capacity. It provides an indication of how heavily the CPUs are being utilized relative to their total capacity.
The reduce function captures the last value of the computed system load. Threshold triggers an alert when the disk I/O is above 80. Math triggers an additional alert when the disk I/O is above 90.
We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.
{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}
The alert message contains a Severity: Critical
when the system load is above 90, and Severity: Warning
when it is above 80
- Navigate to "Server Admin" > "Users"
- Click "Invite User" or "New User"
- Fill in required details:
- Name
- Username (if "New User")
- Password (if "New User")
- Set organization and role
- Click "Submit"
- Admin: Full access to all organizations
- Editor: Can edit and create dashboards
- Viewer: Can view dashboards
- Admin: Manage users and permissions within the organization
- Editor: Edit and create dashboards
- Viewer: View dashboards
- Go to "Configuration" > "Teams"
- Click "New Team"
- Add members and set permissions
- Navigate to the desired folder
- Click the gear icon > "Permissions"
- Add users, teams, or roles
- Set appropriate access levels
- Open the dashboard
- Click the gear icon > "Permissions"
- Add users, teams, or roles
- Set viewer, editor, or admin access
- Introduction
- Server Setup
- PostgreSQL Setup
- NGINX installation
- RabbitMQ
- Cloning of repo and creating of app directories