Skip to content

Latest commit

 

History

History
152 lines (115 loc) · 11 KB

File metadata and controls

152 lines (115 loc) · 11 KB

Terraform Kubernetes Kube-Prometheus Stack

Introduction

This module deploys and configures the Kube-Prometheus Stack inside a Kubernetes Cluster.

Requirements

Name Version
terraform >= 1.0
helm >= 2.0.0

Providers

Name Version
helm >= 2.0.0
kubernetes n/a

Inputs

Name Description Type Default Required
chart_version Version of the Helm chart any n/a yes
helm_namespace The namespace Helm will install the chart under any n/a yes
cluster_domain Cluster domain for DestinationRules string "cluster.local" no
destinationrules_labels Labels applied to DestinationRules map(string) {} no
destinationrules_mode DestionationRule TLS mode string "DISABLE" no
enable_destinationrules Creates DestinationRules for Prometheus, Alertmanager, Grafana, and Node Exporters bool false no
enable_prometheusrules Adds PrometheusRules for alerts bool true no
helm_release The name of the Helm release string "kube-prometheus-stack" no
helm_repository The repository where the Helm chart is stored string "https://prometheus-community.github.io/helm-charts" no
helm_repository_password The password of the repository where the Helm chart is stored string "" no
helm_repository_username The username of the repository where the Helm chart is stored string "" no
prometheus_pvc_name Used for storage alert. Set if using non-default helm_release string "prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0" no
values Values to be passed to the Helm chart string "" no
alertmanager_replicas Number of replicas for Alertmanager number 1 no

Outputs

Name Description
helm_namespace n/a
helm_release The name of the Helm release. For use by external ServiceMonitors
status n/a

Usage

module "helm_kube_prometheus_stack" {
  source = "git::https://github.com/canada-ca-terraform-modules/terraform-kubernetes-kube-prometheus-stack?ref=v3.3.0"

  chart_version = "43.3.0"
  depends_on = [
    module.namespace_monitoring,
  ]

  helm_namespace  = module.namespace_monitoring.name
  helm_release    = "kube-prometheus-stack"
  helm_repository = "https://prometheus-community.github.io/helm-charts"

  enable_destinationrules = true

  values = <<EOF

EOF
}

Notes

To upgrade an existing Helm release created from the previous module instead of reinstalling into a new Helm release, set helm_release to "prometheus-operator". This will persist Helm release history and some temporary data, but may result in resource name and label aberrations.

It is alternatively possible to reinstall into a new release while persisting existing data in Persistent Volumes from the previous module. This process involves downtime and does not guarantee data compatibility. A guide is available here. Note that there are further steps if multiple components (e.g. both Prometheus and Grafana) were configured with Persistent Volume storage. Their Persistent Volumes will need to be given different labels, and the components' volumeClaimTemplates (defined in Helm values) will need to be given corresponding selectors.

History

Date Release Change
2021-03-26 v1.0.0 1st release
2021-07-05 v1.1.0 1st set of general project alerts
2021-09-07 v1.1.1 CompletedJobsNotCleared scope set to project
2022-03-16 v2.0.0 Convert DestinationRules and PrometheusRules to kubernetes_manifests. Updates for Terraform v1 and nomenclature
2022-07-28 v2.0.1 PrometheusRule severity label updates
2022-08-10 v2.0.2 Refactor the threshold for the VeleroHourlyBackupPartialFailure & VeleroHourlyBackupFailure alert
2022-08-10 v2.0.3 Create the NodeDiskMayFillIn60Hours alert
2022-08-10 v2.0.4 Delete the ManyAlertsFiring & ManyManyAlertsFiring alerts
2022-08-19 v2.0.5 Create the VeleroBackupTakingLongTime alert
2022-08-22 v2.0.6 Fix the VeleroBackupTakingLongTime alert severity level
2022-08-31 v2.0.7 Update nodepool pod capacity alerts and remove unused recording rule
2022-09-02 v2.0.8 Update threshold for when to expect a backup for the VeleroBackupTakingLongTime alert
2022-11-04 v2.1.0 Add several alerts and associated test cases regarding cert manager certificates
2022-11-08 v2.1.1 Adjust ContainerWaiting alert duration to align with PodNotReady
2022-11-16 v2.1.2 Fix node and nodepool pod capacity, NodePodsFull, and NodeReachingPodCapacity alerts
2022-11-24 v2.2.0 Add alert: PrometheusDiskMayFillIn60Hours
2022-12-06 v2.3.0 Add alert: NodeReadinessFlapping
2022-12-15 v2.3.1 Fix the NodeUnschedulable alert severity level
2023-01-04 v3.0.0 Refactor general cluster and namespace alerts. enable_prometheusrules false->true. Removes variables: prometheusrules_labels, cluster_rules_name, namespace_rules_name, cert_manager_rules_name
2023-01-09 v3.1.0 Add runbook links to Prometheus rules
2023-01-11 v3.1.1 Fix ManyContainerRestarts alert to account for multiple metrics sources
2023-02-01 v3.2.0 Node clock alerts and README update
2023-02-03 v3.2.1 Specify sensitive variables
2023-02-08 v3.3.0 Add abilitity to add DestinationRule for Alertmanager replicas
2023-02-16 v3.4.0 Add rules for CoreDNS alerts
2023-03-10 v3.4.1 Fix syntax error in CoreDNS alert rules
2023-03-14 v3.5.0 Add rule for ContainerImagePullProblem, refactor container alert unit tests
2023-03-15 v3.6.0 Add DestinationRule for Thanos Sidecar
2023-03-28 v3.7.0 Add generic PVC alerts
2023-04-05 v3.8.0 Add "cluster" in prometheus rule aggregations to make compatible with Thanos. Add Prometheus heartbeat recording rule
2023-04-19 v3.8.1 Fix CoreDNSDown alert
2023-04-21 v3.8.2 Ensure prometheus heartbeat recording rule is evaluated by Prometheus
2023-05-04 v3.8.3 Fix ContainerImagePullProblem flapping
2023-06-08 v3.9.0 Ignore terminated pods in pod capacity alerts
2023-06-19 v3.9.1 Fix PersistentVolume status alerts
2023-12-07 v3.9.2 Adjust node alerts for clock synchronization
2024-02-29 v3.9.3 Adjust Node and PVC storage alerts
2024-04-15 v3.9.4 Adjust Node alerts, report agentpool, standardize node label
2024-05-31 v3.9.5 Update container alerts
2024-09-09 v3.9.6 Debounce ContainerCrashLooping
2024-12-03 v3.9.7 Add NodeDiskFull and fix/refactor some node alerts

Upgrading

From v1.x to v2.x

  1. Note that in Usage the dependencies array has been replaced by the depends_on array.

  2. If enable_destinationrules was true in v1.x, locate the DestinationRules that were created in helm_namespace. There should be 4 correspoding to Prometheus, Alertmanager, Grafana, and the Prometheus Node Exporter. Delete them prior to the upgrade. If enable_destinationrules remains true, they will be recreated with minimal downtime.

  3. If enable_prometheusrules was true in v1.x, locate the PrometheusRule definitions that were created in helm_namespace. There should be 2: general-platform-alerts and general-project-alerts. Delete them prior to the upgrade. If enable_prometheusrules remains true, they will be recreated. This may resolve any presently firing alerts. If it does, they will fire again once their conditions are met.

    • The default names for these PrometheusRule resources are now general-cluster-alerts and general-namespace-alerts. The scopes have changed from platform to cluster and from project to namespace. Adjust Alertmanager routing criteria accordingly.
    • The severities for these rules have been adjusted from minor/major/urgent to debug/minor/major. Adjust Alertmanager routing criteria accordingly.

Previous Module

This module replaces terraform-kubernetes-prometheus. The previous module used the custom chart prometheus-operator, which used the now-deprecated upstream chart prometheus-operator as a sub-chart and added DestinationRules.

This new module uses the new upstream chart kube-prometheus-stack directly. DestinationRules, as well as a set of general alerts, can be added through the module.

To migrate from the old custom chart to the new upstream chart, the following changes should be made to Helm values:

  1. Remove the top-level prometheus-operator: and realign indentation, as you are no longer applying values to a subchart.
  2. Remove any destinationRule: specification and its contents, as this is now handled by terraform variables.

The upstream prometheus-operator chart was renamed to kube-prometheus-stack to reflect that additional components beyond the Prometheus Operator are installed.