-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
94 add app plane observability stack #97
Conversation
- bootstrap - org - app plane comment out actions
remove flows
thanos endpoint
fix namespace
labeldrop
add vmauth dashboard
This reverts commit 4cd6ac6.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughThe changes remove several GitHub Actions workflows for applying and planning infrastructure on GCP and AWS while introducing a multitude of new configuration files for Argo CD, Kubernetes observability resources, and alerting rules. Numerous new ApplicationSet and Application manifests, Grafana dashboards, and VMRule alerts have been added to enhance monitoring. In addition, Terraform configurations for both AWS and GCP environments have been updated with new provider lock files, backend settings, and variable definitions. A new shell script has been added for adding clusters with observability, and supporting documentation for the Observability Stack Control Plane has been introduced. Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant S as add_cluster.sh
participant CP as Cloud Provider API (GCP/AWS)
participant A as Argo CD Server
participant VM as VMUser Secret Service
U->>S: Execute add_cluster.sh
S->>S: Check for required tools (gum, etc.)
S->>U: Prompt for cloud platform & env variables
S->>CP: Verify/create observability node pool
CP-->>S: Node pool verified/created
S->>A: Log in to Argo CD and register cluster
A-->>S: Cluster credentials applied
S->>VM: Check for vmuser secret creation
VM-->>S: Secret available
S->>U: Display completion message ("Observability stack setup complete")
sequenceDiagram
participant U as User
participant T as Terraform CLI
participant P as Provider API (AWS/GCP)
participant R as Resource Modules
U->>T: Run Terraform deployment
T->>P: Initialize providers and backend (S3/GCS)
T->>R: Create organizational units, accounts, and buckets
R-->>T: Resource creation confirmed
T->>T: Generate outputs and lock files
T->>U: Display deployment results and outputs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 30
🧹 Nitpick comments (61)
tofu/gcp/observability_stack/control_plane/k8s/providers.tf (2)
3-6
: Consider using a more flexible version constraint for the google provider.Using a fixed version (
5.45.0
) may prevent you from getting important security updates. Consider using a more flexible constraint that allows patch updates while maintaining stability.google = { source = "hashicorp/google" - version = "5.45.0" + version = "~> 5.45.0" }
7-22
: Consider using more specific version constraints.Using
>=
without an upper bound could lead to unexpected breaking changes when major versions are released. Consider using the~>
operator to allow only minor and patch updates.github = { source = "integrations/github" - version = ">= 6.1" + version = "~> 6.1" } helm = { source = "hashicorp/helm" - version = ">= 2.12" + version = "~> 2.12" } kubernetes = { source = "hashicorp/kubernetes" - version = ">= 2.27" + version = "~> 2.27" } tls = { source = "hashicorp/tls" - version = ">= 4.0" + version = "~> 4.0" }tofu/gcp/observability_stack/control_plane/k8s/values/argocd-dev.yaml (2)
28-29
: Review automated sync configuration.Both applications have automated sync enabled without any safety measures. This could be risky as:
- Any changes in the source paths will be automatically applied
- No pruning policy is specified
- No retry strategy is defined
- No sync windows are configured
Consider adding safety configurations:
syncPolicy: - automated: {} + automated: + prune: true + selfHeal: true + syncOptions: + - CreateNamespace=true + - PruneLast=true + retry: + limit: 5 + backoff: + duration: 5s + factor: 2 + maxDuration: 3mAlso applies to: 43-44
26-28
: Review architecture: Both stacks targeting the same cluster.Both the control plane and app plane observability stacks are targeting the same cluster (
kubernetes.default.svc
). This might not be ideal for:
- Resource isolation
- Blast radius containment
- Performance impact
Consider:
- Separating the control plane and app plane into different clusters
- If they must be in the same cluster, use different namespaces and resource quotas
Also applies to: 41-43
argocd/app_plane/README (3)
19-35
: Add validation for required environment variables.Consider adding validation checks to ensure all required environment variables are set before proceeding with the setup.
Add this validation script at the beginning:
required_vars=("ARGOCD_SERVER" "PROJECT" "REGION" "CLUSTER" "APP_PLANE_CTX" "PAGERDUTY_API_KEY" "CTRL_PLANE_CTX") for var in "${required_vars[@]}"; do if [ -z "${!var}" ]; then echo "Error: $var is not set" exit 1 fi done
56-61
: Consider using a secrets management solution.Storing API keys directly in Kubernetes secrets might not be the most secure approach. Consider using a dedicated secrets management solution like HashiCorp Vault or cloud provider's secret management services.
1-12
: Add prerequisites section to the documentation.The documentation would benefit from a prerequisites section that lists required tools, permissions, and access levels needed before starting the setup process.
Consider adding:
- Required tools (kubectl, gcloud/aws CLI, argocd CLI)
- Required permissions
- Network requirements
- Minimum resource requirements
tofu/aws/1-bootstrap/outputs.tf (1)
1-3
: Add description and consider sensitive flag for the output.The output would benefit from a description field explaining its purpose and usage. Also, consider if this ID should be marked as sensitive.
output "ou_id" { + description = "The ID of the AWS organizational unit" + sensitive = true value = data.aws_organizations_organizational_unit.ou.id }argocd/ctrl_plane/dev/manifests/grafana-managed-certificate.yaml (1)
1-7
: Consider platform portability and domain verification.The configuration uses GKE-specific API (
networking.gke.io/v1
), which creates platform lock-in.Recommendations:
- Document the GKE dependency in the project README
- Consider a platform-agnostic solution like cert-manager if multi-cloud is a future possibility
- Verify that the domain
grafana.observability.dev.internal.falkordb.cloud
is:
- Correctly configured in DNS
- Accessible from intended clients
- Follows your organization's naming conventions
✅ Verification successful
🌐 Web query:
What are the alternatives to GKE ManagedCertificates for Kubernetes TLS certification management?
💡 Result:
When managing TLS certificates in Kubernetes environments outside of GKE ManagedCertificates, several robust alternatives exist with different integration models and use cases:
Core Alternatives
1. Cert-Manager with ACME Issuers (Let’s Encrypt)
- Automates TLS certificate issuance/renewal via ACME protocols[2][7]
- Supports HTTP-01 and DNS-01 challenges for domain validation[7]
- Integrates with multiple certificate authorities:
- Let’s Encrypt (free public CA)[7]
- Venafi
- HashiCorp Vault
- EJBCA enterprise CA[5]
- Works across all major cloud providers (GKE, AKS, EKS)[6]
2. Native Kubernetes Certificate API
- Uses
certificates.k8s.io
API for internal PKI[8]- Requires cluster CA configuration
- Best for:
- Service-to-service mTLS
- Internal cluster communications
- Short-lived certificates
3. Google Cloud Certificate Manager
- Regional certificate management service[4][9]
- Supports:
- Google-managed certificates
- Certificate maps (grouping multiple certs)
- Pre-shared certificates
- Requires Gateway API (gke-l7-gxlb class)[4]
GCP-Specific Options
Method Use Case Limitations Manual SslCertificate Resources[4][9] Legacy systems/edge cases No automatic renewal MultiClusterIngress Certificates[9] Multi-cluster deployments No wildcard support Pre-shared Certificates[9] Bring-your-own-certificate Manual rotation Advanced Integrations
EJBCA Enterprise PKI
- Provides X.509 certificate lifecycle management[5]
- Features:
- Custom certificate profiles
- Hardware Security Module (HSM) integration
- Audit trails
- Integration path:
- Deploy EJBCA issuer for cert-manager
- Configure RA certificate workflow
- Issue certs via Kubernetes manifests[5]
Hashicorp Vault
- Secrets management with PKI engine
- Dynamic credential issuance
- Often paired with cert-manager as issuer
Each solution has tradeoffs in automation level, certificate lifetime management, and integration complexity[2][5][7]. For GKE environments requiring Google infrastructure integration, Certificate Manager offers tight load balancer integration[4], while cert-manager provides maximum flexibility across environments[2][7].
Citations:
- 1: https://github.com/GoogleCloudPlatform/gke-managed-certs
- 2: https://kubeops.net/blog/automate-tls-certificate-management-in-kubernetes-with-cert-manager
- 3: https://discuss.dgraph.io/t/better-practice-google-tls-ingress-to-non-encrypted-backend-vs-tls-to-dgraph/10695
- 4: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/How-do-you-use-GKE-managed-certificates-with-K8s-Gateway/m-p/547671
- 5: https://www.ejbca.org/case/automate-tls-certificate-management-in-kubernetes-with-certmanager-and-ejbca/
- 6: https://cert-manager.io/v1.1-docs/installation/kubernetes/
- 7: https://cert-manager.io/docs/tutorials/getting-started-with-cert-manager-on-google-kubernetes-engine-using-lets-encrypt-for-ingress-ssl/
- 8: https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/
- 9: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/MultiClusterIngress-TLS/m-p/634818
- 10: https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs?hl=en
Consider platform portability and domain verification
The web search confirms that several robust alternatives exist—such as using cert-manager with ACME issuers—for managing TLS certificates in Kubernetes, which can help avoid platform lock-in. The GKE-specific API (
networking.gke.io/v1
) inherently ties the configuration to GKE. Given that:
- The configuration is correct for a GKE-managed certificate.
- Alternatives (like cert-manager, native Kubernetes Certificate API, or Google Cloud Certificate Manager) provide options for multi-cloud portability.
- Documentation and domain verification (DNS configuration, client accessibility, adherence to naming conventions) remain essential.
Recommendations:
- Document the GKE dependency in the project README.
- Consider transitioning to a platform-agnostic solution (e.g., cert-manager) if multi-cloud support becomes a priority.
- Verify that the domain
grafana.observability.dev.internal.falkordb.cloud
is correctly configured in DNS, accessible to intended clients, and follows organizational naming conventions.argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml (1)
1-8
: Consider adding common labels and annotations.The kustomization could benefit from additional Kustomize features:
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization metadata: name: vmuser-kustomization namespace: observability resources: - vmuser.yaml - string-secret.yaml + commonLabels: + app.kubernetes.io/name: vmuser + app.kubernetes.io/part-of: observability-stack + commonAnnotations: + maintainer: team-name +Also, add a newline at the end of the file to comply with POSIX standards.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 8-8: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/kustomize/vmuser/vmuser.yaml (1)
4-11
: Consider more descriptive naming.Using "vmuser" for multiple resources (name, username, secret) could be confusing. Consider using more descriptive names that indicate the purpose of this user (e.g., "metrics-collector", "vm-scraper").
argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml (2)
16-17
: Review security implications of skipping TLS verification.Setting
tlsSkipVerify: true
disables certificate validation, which could expose the system to man-in-the-middle attacks. Consider implementing proper TLS certificates for secure communication.
20-20
: Add newline at end of file.Add a newline character at the end of the file to comply with POSIX standards and avoid potential issues with text processing tools.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/manifests/grafana-datasource.yaml (2)
16-17
: Review security implications of skipping TLS verification.Setting
tlsSkipVerify: true
disables certificate validation, which could expose the system to man-in-the-middle attacks. Consider implementing proper TLS certificates for secure communication.
20-20
: Add newline at end of file.Add a newline character at the end of the file to comply with POSIX standards and avoid potential issues with text processing tools.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
tofu/aws/1-bootstrap/variables.tf (2)
18-21
: Add validation rules for email format.Consider adding validation rules to ensure the email format is correct:
variable "account_email" { type = string description = "The email of the account to create." + validation { + condition = can(regex("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", var.account_email)) + error_message = "The account_email must be a valid email address." + } }
23-26
: Add validation for AWS regions.Consider adding validation rules to ensure only valid AWS regions are provided:
variable "region" { type = string description = "The region to create the bucket in." + validation { + condition = can(regex("^[a-z]{2}(-[a-z]+)+-\\d$", var.region)) + error_message = "The region must be a valid AWS region (e.g., us-west-2, eu-central-1)." + } }argocd/ctrl_plane/dev/certmanager.yaml (1)
1-25
: LGTM! Well-configured cert-manager application.The configuration is well-structured with appropriate node selection and CRD installation settings.
Add a newline at the end of the file:
- CreateNamespace=true +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
tofu/aws/2-org/main.tf (1)
1-20
: LGTM! Well-structured AWS Organizations configuration.The configuration correctly sets up the organizational unit and account with appropriate settings and lifecycle management.
The
depends_on
block is redundant as Terraform already infers this dependency from theparent_id
reference. Consider removing it:role_name = "OrganizationAccountAccessRole" close_on_deletion = false - depends_on = [ aws_organizations_organizational_unit.workloads ] - lifecycle { ignore_changes = [role_name, name] }argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml (1)
1-25
: LGTM! Well-configured secret generator application.The configuration is well-structured with appropriate node selection and sync options.
Fix indentation and add a newline:
helm: valuesObject: - nodeSelector: - node_pool: observability-resources + nodeSelector: + node_pool: observability-resources destination: server: https://kubernetes.default.svc namespace: observability syncPolicy: syncOptions: - CreateNamespace=true - RespectIgnoreDifferences=true automated: prune: true selfHeal: true +🧰 Tools
🪛 YAMLlint (1.35.1)
[warning] 14-14: wrong indentation: expected 8 but found 10
(indentation)
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
tofu/gcp/observability_stack/control_plane/k8s/variables.tf (2)
1-19
: Add descriptions for required variables.Each variable should have a description block explaining its purpose and any constraints.
variable "project_id" { type = string + description = "The GCP project ID where resources will be created" } variable "region" { type = string + description = "The GCP region where resources will be created" } variable "cluster_endpoint" { type = string + description = "The endpoint URL of the Kubernetes cluster" } variable "cluster_ca_certificate" { type = string + description = "The CA certificate for cluster authentication" + sensitive = true } variable "cluster_name" { type = string + description = "The name of the Kubernetes cluster" }
21-39
: Add descriptions for variables with defaults.Variables with default values should also have descriptions for clarity.
variable "github_organization" { type = string default = "FalkorDB" + description = "The GitHub organization name where the repository is located" } variable "github_repository" { type = string default = "falkordb-observability-cluster" + description = "The name of the GitHub repository containing observability configurations" } variable "environment" { type = string default = "production" + description = "The deployment environment (must be either 'production' or 'development')" validation { condition = var.environment == "production" || var.environment == "development" error_message = "Environment must be either 'production' or 'development'" } }observability/rules/node-network.rules.yml (1)
14-17
: Enhance alert annotations with troubleshooting steps.While the current annotations are good, consider adding basic troubleshooting steps to help operators respond quickly.
annotations: description: Network interface "{{ $labels.device }}" changing its up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }} runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/nodenetworkinterfaceflapping summary: Network interface is often changing its status + troubleshooting: "1. Check physical network connectivity\n2. Verify network interface configuration\n3. Review system logs for network-related errors"
observability/rules/containercpuusagesecondstotal.rules.yml (1)
15-19
: Consider adjusting the rate calculation window.The current 5-minute window for irate calculation might mask short-term CPU spikes. Consider using a shorter window (e.g., 2m) for more granular monitoring, especially in environments with rapid workload changes.
sum by (namespace,pod,container,cluster) ( - irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m]) + irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[2m]) ) * on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) ( 1, max by (namespace,pod,node,cluster) (kube_pod_info{node!=""})argocd/app_plane/vmrules.yaml (1)
16-16
: Consider stricter regex pattern for name generation.The current regex
h?c-[A-Za-z0-9]+
allows for potentially unexpected cluster names. Consider using a more specific pattern that matches your exact cluster naming convention.tofu/gcp/observability_stack/control_plane/k8s/values/argocd-prod.yaml (1)
2-5
: Use semantic versioning for image tagsInstead of using a specific version, consider using semantic versioning to ensure consistent updates and rollbacks.
global: image: - tag: "v2.13.3" + tag: "~2.13" logging: format: jsonobservability/rules/containercpulimits.rules.yml (2)
13-20
: Add annotations to document metric purpose and usageThe rule lacks documentation about what the metric represents and how it should be used.
- annotations: {} + annotations: + description: "Active CPU resource limits for containers in running/pending pods" + summary: "Container CPU Limits by Pod"
13-33
: Consider adding alert rules for resource limit monitoringWhile recording rules are useful, adding alert rules for high resource utilization would enhance monitoring capabilities.
Would you like me to generate alert rules for monitoring CPU limit utilization?
observability/rules/containermemoryrequests.rules.yml (1)
13-20
: Add documentation for the monitoring rule.Consider adding annotations to describe the purpose and expected behavior of this metric.
- annotations: {} + annotations: + description: "Active memory requests for pods in Pending/Running state" + summary: "Container memory requests by pod" expr: |-argocd/app_plane/manifests/grafana.yaml (1)
22-31
: Consider using a more specific security context for init container.While running as root is needed for chown, consider dropping unnecessary capabilities and adding securityContext restrictions.
securityContext: runAsUser: 0 runAsGroup: 0 + capabilities: + drop: ["ALL"] + add: ["CHOWN"] + allowPrivilegeEscalation: false initContainers: - name: init-grafana-data image: busybox + securityContext: + readOnlyRootFilesystem: trueobservability/grafana/base/grafana-dashboards.yaml (1)
1-72
: Consider using Kustomize patches to reduce repetition.The dashboard configurations are very similar, differing only in name and ConfigMap reference. Consider using Kustomize patches to maintain these configurations more efficiently.
Create a base dashboard template and use patches:
# base/dashboard.yaml apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: base-dashboard spec: instanceSelector: matchLabels: dashboards: "grafana" configMapRef: key: json # patches/dashboards.yaml - op: replace path: /metadata/name value: cluster-overview - op: replace path: /spec/configMapRef/name value: dashboards-k8s-views-cluster🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 72-72: no new line character at the end of file
(new-line-at-end-of-file)
observability/rules/prometheus-node-recording.rules.yml (2)
13-16
: Add documentation for CPU usage recording rule.Consider adding annotations to describe the metric calculation and its significance.
- annotations: {} + annotations: + description: "Sum of non-idle CPU usage rate over 3m window" + summary: "Node CPU usage rate" expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) BY (instance,cluster)
26-28
: Consider adjusting the CPU ratio calculation window.The 5m window might be too long for detecting rapid changes in CPU usage. Consider using a shorter window (e.g., 2m) for more responsive monitoring.
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) WITHOUT (cpu, mode) / ON (instance,cluster) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance,cpu,cluster)) BY (instance,cluster) + expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[2m])) WITHOUT (cpu, mode) / ON (instance,cluster) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance,cpu,cluster)) BY (instance,cluster)argocd/ctrl_plane/dev/vmusers.yaml (1)
52-56
: Consider adding sync wave annotations for controlled rollout.The automated sync with pruning and self-healing could potentially cause disruption if multiple resources are synced simultaneously. Consider adding sync wave annotations to control the rollout order.
syncPolicy: syncOptions: - CreateNamespace=true + annotations: + argocd.argoproj.io/sync-wave: "1" automated: prune: true selfHeal: trueobservability/rules/node.rules.yml (2)
15-20
: Consider adding a limit to the topk function.The
topk
function without an explicit limit could potentially return a large number of results. Consider adding a specific limit value.- topk by (namespace,pod,cluster) (1, + topk(10) by (namespace,pod,cluster) (
44-51
: Add recording rule for individual CPU mode metrics.The current rule aggregates all non-idle CPU modes. Consider adding separate recording rules for individual modes to aid in troubleshooting.
+ - annotations: {} + expr: |- + rate(node_cpu_seconds_total{mode!="idle",job="node-exporter"}[5m]) + labels: {} + record: node:node_cpu_mode:rate5margocd/ctrl_plane/dev/manifests/grafana.yaml (1)
13-14
: Consider adding storage limits.Only storage requests are specified. Consider adding storage limits to prevent unbounded growth.
requests: storage: 10Gi + limits: + storage: 20Giobservability/rules/podowner.rules.yml (2)
15-30
: Simplify nested label_replace operations.The deployment rule uses nested
label_replace
operations which could impact performance. Consider simplifying the expression.expr: |- max by (namespace,workload,pod,cluster) ( - label_replace( - label_replace( - kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"}, - "replicaset", "$1", "owner_name", "(.*)" - ) * on (replicaset,namespace,cluster) group_left(owner_name) topk by (replicaset,namespace,cluster) ( - 1, max by (replicaset,namespace,owner_name,cluster) ( - kube_replicaset_owner{job="kube-state-metrics"} - ) - ), - "workload", "$1", "owner_name", "(.*)" - ) + label_replace( + kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"} + * on(owner_name,namespace,cluster) group_left(deployment) + label_replace( + kube_replicaset_owner{job="kube-state-metrics"}, + "deployment", "$1", "owner_name", "(.*)" + ), + "workload", "$1", "deployment", "(.*)" + )
29-29
: Add descriptions to workload type labels.Consider adding descriptions in the annotations to explain the significance of each workload type label.
labels: workload_type: deployment + annotations: + description: "Pods owned by Deployment workloads"Also applies to: 40-40, 51-51, 62-62
observability/rules/general.rules.yml (1)
18-19
: Consider lowering the TargetDown threshold.The current threshold of 10% might be too high for critical services. A lower threshold (e.g., 5%) would provide earlier detection of potential issues.
- expr: 100 * (count(up == 0) BY (job,namespace,service,cluster) / count(up) BY (job,namespace,service,cluster)) > 10 + expr: 100 * (count(up == 0) BY (job,namespace,service,cluster) / count(up) BY (job,namespace,service,cluster)) > 5observability/rules/vmoperator.rules.yml (1)
25-25
: Consider adding error rate thresholds.The current expressions trigger on any error (> 0), which might lead to alert fatigue. Consider adding thresholds for acceptable error rates.
- ) by (cluster) > 0 + ) by (cluster) > 0.05 # 5% error rate thresholdAlso applies to: 42-42
observability/rules/kube-state-metrics.rules.yml (1)
25-25
: Consider adjusting alert severities.Not all alerts may warrant critical severity. Consider:
- Keep KubeStateMetricsShardingMismatch and KubeStateMetricsShardsMissing as critical
- Change ListErrors and WatchErrors to warning as they might be transient
- severity: critical + severity: warningAlso applies to: 38-38, 47-47, 60-60
argocd/app_plane/victoriametrics.yaml (1)
136-136
: Add newline at end of file.Add a newline character at the end of the file to comply with YAML standards.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 136-136: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/victoriametrics.yaml (1)
78-79
: Validate PagerDuty API endpoint.The PagerDuty URL is hardcoded. Consider making it configurable via a values file to support different environments or API versions.
observability/rules/vm-health.rules.yml (2)
72-72
: Consider adjusting the log error threshold.The current expression
sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 0
will trigger on any error log. This might be too sensitive and could lead to alert fatigue.Consider adjusting the threshold to a more reasonable value:
- expr: sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 0 + expr: sum(increase(vm_log_messages_total{level="error"}[5m])) without (app_version, location) > 10
45-45
: Consider making memory threshold configurable.The memory usage threshold is hardcoded to 80%. Different environments might need different thresholds based on their specific requirements.
Consider making the threshold configurable through a parameter:
- expr: (min_over_time(process_resident_memory_anon_bytes[10m]) / vm_available_memory_bytes) > 0.8 + expr: (min_over_time(process_resident_memory_anon_bytes[10m]) / vm_available_memory_bytes) > {{ .Values.memoryThreshold | default 0.8 }}scripts/add_cluster.sh (1)
71-76
: Consider making node pool/group specifications configurable.The node specifications (machine type, disk size, etc.) are hardcoded. Different environments might need different specifications.
Consider accepting these values as parameters or reading from a configuration file:
+# Add at the beginning of the script +DEFAULT_MACHINE_TYPE_GCP="e2-standard-2" +DEFAULT_MACHINE_TYPE_AWS="m5.large" +DEFAULT_DISK_SIZE="50" +DEFAULT_MAX_NODES="10" # Use variables in node pool/group creation - --machine-type=e2-standard-2 \ - --disk-size=50 \ + --machine-type=${MACHINE_TYPE:-$DEFAULT_MACHINE_TYPE_GCP} \ + --disk-size=${DISK_SIZE:-$DEFAULT_DISK_SIZE} \ - --instance-types m5.large \ - --disk-size 50 \ + --instance-types ${MACHINE_TYPE:-$DEFAULT_MACHINE_TYPE_AWS} \ + --disk-size ${DISK_SIZE:-$DEFAULT_DISK_SIZE} \Also applies to: 92-96
observability/rules/alertmanager.rules.yml (1)
54-68
: Consider renaming duplicate alerts.Two alerts named
AlertmanagerClusterFailedToSendAlerts
with different severities might cause confusion.Consider renaming the alerts to better reflect their purpose:
- - alert: AlertmanagerClusterFailedToSendAlerts + - alert: AlertmanagerClusterFailedToSendCriticalAlerts # ... (critical severity alert) - - alert: AlertmanagerClusterFailedToSendAlerts + - alert: AlertmanagerClusterFailedToSendNonCriticalAlerts # ... (warning severity alert)Also applies to: 69-83
observability/rules/kubernetes-apps.rules.yml (2)
18-18
: Optimize namespace matching expressions.The expression
namespace=~".*"
matches all namespaces and is redundant. Removing it would make the queries more efficient.Remove redundant namespace matching:
- node_cpu_seconds_total{job="node-exporter",mode="idle"} + node_cpu_seconds_total{job="node-exporter",mode="idle"} - kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Pending|Unknown|Failed"} + kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown|Failed"}Apply similar changes to other expressions.
Also applies to: 30-30, 44-46, 57-59, 74-74, 86-88, 103-105, 122-125
205-205
: Consider adjusting job failure thresholds.The job failure alerts might be too sensitive for some environments, especially for jobs that are designed to fail occasionally.
Consider adding failure count thresholds:
- expr: kube_job_failed{job="kube-state-metrics", namespace=~".*"} > 0 + expr: kube_job_failed{job="kube-state-metrics"} > 3Also applies to: 213-213
observability/rules/node-exporter.rules.yml (2)
100-100
: Consider renaming duplicate alerts.Multiple alerts share the same name but have different severities and thresholds. This could cause confusion when referencing alerts.
Consider renaming alerts to better reflect their severity:
- - alert: NodeFilesystemSpaceFillingUp + - alert: NodeFilesystemSpaceFillingUpWarning # ... (warning severity alert) - - alert: NodeFilesystemSpaceFillingUp + - alert: NodeFilesystemSpaceFillingUpCritical # ... (critical severity alert)Apply similar changes to other duplicate alerts.
Also applies to: 116-116, 146-146, 206-206, 303-303, 315-315
333-333
: Consider making resource thresholds configurable.Resource usage thresholds are hardcoded, which might not be suitable for all environments.
Consider making thresholds configurable through parameters:
- expr: sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{job="node-exporter", mode!~"idle|iowait"}[2m]))) * 100 > 90 + expr: sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{job="node-exporter", mode!~"idle|iowait"}[2m]))) * 100 > {{ .Values.cpuThreshold | default 90 }} - expr: 100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > 90 + expr: 100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > {{ .Values.memoryThreshold | default 90 }} - expr: rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m]) > 10 + expr: rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m]) > {{ .Values.diskIOThreshold | default 10 }}Also applies to: 367-367, 378-378
argocd/ctrl_plane/dev/alloy.yaml (3)
1-13
: Consider adding additional metadata labels for better organization.While the basic metadata is present, consider adding labels like
app.kubernetes.io/name
,app.kubernetes.io/part-of
, etc., to follow Kubernetes recommended label practices.metadata: name: grafana-alloy-ctrl-plane namespace: argocd + labels: + app.kubernetes.io/name: grafana-alloy + app.kubernetes.io/part-of: observability + app.kubernetes.io/component: metrics
28-278
: LGTM! Secure configuration with proper TLS settings.The configuration implements secure communication with proper TLS verification and token-based authentication.
Consider splitting the metric filtering regex for better maintainability.
The long regex pattern for filtering metrics could be split into multiple rules or moved to a ConfigMap for easier maintenance.
279-293
: Consider adding pod anti-affinity for high availability.While node selection is configured, consider adding pod anti-affinity to ensure high availability by preventing multiple pods from being scheduled on the same node.
controller: type: statefulset nodeSelector: node_pool: observability-resources + affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/name: grafana-alloy + topologyKey: kubernetes.io/hostnameobservability/grafana/dashboards/falkordb-cloud.json (1)
1-24
: Consider enhancing dashboard documentation.While the dashboard is well-structured, consider adding:
- A more detailed description explaining the purpose of each panel
- Links to relevant documentation or runbooks
- Tags for better categorization
"description": "FalkorDB Dashboard", +"description": "Dashboard for monitoring FalkorDB instances. Shows key metrics including memory usage, client connections, and command performance.\n\nMetrics:\n- Uptime\n- Client Connections\n- Memory Usage\n- Command Performance\n\nFor more information, see: <link-to-docs>", "tags": [ - "prometheus" + "prometheus", + "falkordb", + "database", + "performance" ],observability/grafana/dashboards/pod-overview.json (1)
514-515
: Consider using recording rules for expensive queriesThe CPU and memory usage queries involve multiple aggregations and calculations that could be expensive, especially with high pod counts:
sum(rate(container_cpu_usage_seconds_total{...}[$__rate_interval])) / sum(kube_pod_container_resource_requests{...})
Consider creating recording rules for frequently used calculations to reduce query load. Example recording rule:
groups: - name: pod_metrics rules: - record: pod:container_cpu_usage:rate expr: sum(rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod", image!="", cluster="$cluster"}[$__rate_interval]))Also applies to: 587-588, 664-665
observability/grafana/dashboards/vmauth.json (1)
1166-1167
: Add documentation links for troubleshooting metricsThe resource usage and file descriptor panels include important operational metrics but lack direct links to troubleshooting documentation.
Add links to relevant documentation in the panel descriptions:
"description": "Percentage of used memory (resident).\nThe application's performance will significantly degrade when memory usage is close to 100%.\n + See troubleshooting guide: https://docs.victoriametrics.com/vmauth.html#troubleshooting"
Also applies to: 1888-1889
observability/grafana/dashboards/namespace-overview.json (2)
109-109
: Consider adding error handling to Prometheus queries.The PromQL queries should include error handling for edge cases. For example:
- Add
or vector(0)
to handle cases where no data is returned- Use
max_over_time
oravg_over_time
instead of raw metrics for more stable visualizationsExample fix for one query:
-expr: "sum(kube_pod_info{namespace=~\"$namespace\", cluster=\"$cluster\"})" +expr: "sum(kube_pod_info{namespace=~\"$namespace\", cluster=\"$cluster\"}) or vector(0)"Also applies to: 183-183, 285-285, 315-315, 325-325, 335-335, 345-345, 355-355, 365-365, 375-375, 385-385
2603-2605
: Consider adjusting refresh interval based on data resolution.The dashboard is set to refresh every 30s, but the resolution variable defaults to 5m. Consider:
- Aligning the refresh interval with the resolution
- Adding a warning when refresh interval is shorter than resolution
observability/grafana/base/kustomization.yaml (1)
1-5
: Fix formatting issues in the Kustomization file.The file has the following formatting issues:
- Line 4: Remove trailing spaces
- Line 5: Add newline at end of file
Apply this diff to fix the formatting:
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization -resources: - - grafana-dashboards.yaml \ No newline at end of file +resources: + - grafana-dashboards.yaml +🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 4-4: trailing spaces
(trailing-spaces)
[error] 5-5: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (99)
.github/workflows/gcp-full-infra-test-apply.yaml
(0 hunks).github/workflows/gcp-full-infra-test-plan.yaml
(0 hunks).github/workflows/testing-apply-aws-infra.yaml
(0 hunks).github/workflows/testing-plan-aws-infra.yaml
(0 hunks)argocd/app_plane/README
(1 hunks)argocd/app_plane/alloy.yaml
(1 hunks)argocd/app_plane/applicationset.yaml
(1 hunks)argocd/app_plane/grafana-dashboards.yaml
(1 hunks)argocd/app_plane/grafana-operator.yaml
(1 hunks)argocd/app_plane/ksm.yaml
(1 hunks)argocd/app_plane/manifests/grafana-datasource.yaml
(1 hunks)argocd/app_plane/manifests/grafana.yaml
(1 hunks)argocd/app_plane/manifests/podmonitor.yaml
(1 hunks)argocd/app_plane/victoriametrics.yaml
(1 hunks)argocd/app_plane/vmrules.yaml
(1 hunks)argocd/ctrl_plane/dev/alloy.yaml
(1 hunks)argocd/ctrl_plane/dev/certmanager.yaml
(1 hunks)argocd/ctrl_plane/dev/grafana-dashboards.yaml
(1 hunks)argocd/ctrl_plane/dev/grafana-operator.yaml
(1 hunks)argocd/ctrl_plane/dev/ksm.yaml
(1 hunks)argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml
(1 hunks)argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml
(1 hunks)argocd/ctrl_plane/dev/kustomize/vmuser/string-secret.yaml
(1 hunks)argocd/ctrl_plane/dev/kustomize/vmuser/vmuser.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/grafana-managed-certificate.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/grafana.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/letsencrypt-issuer.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/vmauth-tls.yaml
(1 hunks)argocd/ctrl_plane/dev/victoriametrics.yaml
(1 hunks)argocd/ctrl_plane/dev/vmrules.yaml
(1 hunks)argocd/ctrl_plane/dev/vmusers.yaml
(1 hunks)observability/grafana/base/grafana-dashboards.yaml
(1 hunks)observability/grafana/base/kustomization.yaml
(1 hunks)observability/grafana/dashboards/falkordb-cloud.json
(1 hunks)observability/grafana/dashboards/namespace-overview.json
(1 hunks)observability/grafana/dashboards/pod-overview.json
(1 hunks)observability/grafana/dashboards/prometheus-dashboard.json
(1 hunks)observability/grafana/dashboards/vmauth.json
(1 hunks)observability/grafana/kustomization.yaml
(1 hunks)observability/rules/alertmanager.rules.yml
(1 hunks)observability/rules/containercpulimits.rules.yml
(1 hunks)observability/rules/containercpurequests.rules.yml
(1 hunks)observability/rules/containercpuusagesecondstotal.rules.yml
(1 hunks)observability/rules/containermemorycache.rules.yml
(1 hunks)observability/rules/containermemorylimits.rules.yml
(1 hunks)observability/rules/containermemoryrequests.rules.yml
(1 hunks)observability/rules/containermemoryrss.rules.yml
(1 hunks)observability/rules/containermemoryswap.rules.yml
(1 hunks)observability/rules/containermemoryworkingsetbytes.rules.yml
(1 hunks)observability/rules/general.rules.yml
(1 hunks)observability/rules/kube-state-metrics.rules.yml
(1 hunks)observability/rules/kubernetes-apps.rules.yml
(1 hunks)observability/rules/kubernetes-resources.rules.yml
(1 hunks)observability/rules/kubernetes-storage.rules.yml
(1 hunks)observability/rules/node-exporter.rules.yml
(1 hunks)observability/rules/node-network.rules.yml
(1 hunks)observability/rules/node.rules.yml
(1 hunks)observability/rules/podowner.rules.yml
(1 hunks)observability/rules/prometheus-general.rules.yml
(1 hunks)observability/rules/prometheus-node-recording.rules.yml
(1 hunks)observability/rules/vm-health.rules.yml
(1 hunks)observability/rules/vmoperator.rules.yml
(1 hunks)observability/rules/vmsingle.rules.yml
(1 hunks)scripts/add_cluster.sh
(1 hunks)tofu/aws/.terraform.lock.hcl
(0 hunks)tofu/aws/1-bootstrap/.terraform.lock.hcl
(1 hunks)tofu/aws/1-bootstrap/backend.tf
(1 hunks)tofu/aws/1-bootstrap/main.tf
(1 hunks)tofu/aws/1-bootstrap/outputs.tf
(1 hunks)tofu/aws/1-bootstrap/providers.tf
(1 hunks)tofu/aws/1-bootstrap/variables.tf
(1 hunks)tofu/aws/2-org/.terraform.lock.hcl
(1 hunks)tofu/aws/2-org/backend.tf
(1 hunks)tofu/aws/2-org/main.tf
(1 hunks)tofu/aws/2-org/variables.tf
(1 hunks)tofu/aws/3-application_plane/.terraform.lock.hcl
(1 hunks)tofu/aws/3-application_plane/backend.tf
(1 hunks)tofu/aws/3-application_plane/main.tf
(1 hunks)tofu/aws/3-application_plane/variables.tf
(1 hunks)tofu/aws/main.tf
(0 hunks)tofu/aws/outputs.tf
(0 hunks)tofu/aws/variables.tf
(0 hunks)tofu/gcp/bootstrap/seed_project/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/README
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/.terraform.lock.hcl
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/backend.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/outputs.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/providers.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/variables.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/.terraform.lock.hcl
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/backend.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/providers.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/values/argocd-dev.yaml
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/values/argocd-prod.yaml
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/variables.tf
(1 hunks)
💤 Files with no reviewable changes (8)
- .github/workflows/testing-plan-aws-infra.yaml
- .github/workflows/gcp-full-infra-test-plan.yaml
- .github/workflows/gcp-full-infra-test-apply.yaml
- tofu/aws/outputs.tf
- .github/workflows/testing-apply-aws-infra.yaml
- tofu/aws/variables.tf
- tofu/aws/main.tf
- tofu/aws/.terraform.lock.hcl
✅ Files skipped from review due to trivial changes (12)
- tofu/aws/2-org/backend.tf
- tofu/gcp/bootstrap/seed_project/main.tf
- tofu/gcp/observability_stack/control_plane/README
- tofu/aws/2-org/.terraform.lock.hcl
- tofu/aws/3-application_plane/backend.tf
- tofu/gcp/observability_stack/control_plane/k8s/backend.tf
- tofu/aws/1-bootstrap/.terraform.lock.hcl
- tofu/aws/1-bootstrap/providers.tf
- tofu/gcp/observability_stack/control_plane/infra/providers.tf
- tofu/aws/3-application_plane/.terraform.lock.hcl
- tofu/gcp/observability_stack/control_plane/infra/.terraform.lock.hcl
- tofu/gcp/observability_stack/control_plane/k8s/.terraform.lock.hcl
🧰 Additional context used
🪛 YAMLlint (1.35.1)
argocd/ctrl_plane/dev/manifests/vmauth-tls.yaml
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml
[warning] 14-14: wrong indentation: expected 8 but found 10
(indentation)
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/manifests/grafana-datasource.yaml
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml
[error] 8-8: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/certmanager.yaml
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/victoriametrics.yaml
[error] 136-136: no new line character at the end of file
(new-line-at-end-of-file)
observability/grafana/base/grafana-dashboards.yaml
[error] 72-72: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/alloy.yaml
[error] 38-38: trailing spaces
(trailing-spaces)
[error] 43-43: trailing spaces
(trailing-spaces)
observability/grafana/base/kustomization.yaml
[error] 4-4: trailing spaces
(trailing-spaces)
[error] 5-5: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Checkov (3.2.334)
tofu/aws/3-application_plane/main.tf
[HIGH] 59-70: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
[HIGH] 95-106: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
tofu/gcp/observability_stack/control_plane/infra/main.tf
[HIGH] 191-225: Ensure public access prevention is enforced on Cloud Storage bucket
(CKV_GCP_114)
🪛 Shellcheck (0.10.0)
scripts/add_cluster.sh
[warning] 23-23: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 24-24: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 25-25: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 28-28: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 29-29: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 30-30: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 31-31: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 33-33: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 34-34: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 35-35: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 36-36: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 37-37: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 107-107: Quote this to prevent word splitting.
(SC2046)
[warning] 148-148: Quote this to prevent word splitting.
(SC2046)
🔇 Additional comments (45)
tofu/gcp/observability_stack/control_plane/k8s/values/argocd-dev.yaml (1)
10-14
: Consider repository access configuration.The repository is configured using HTTPS without explicit authentication. If this repository becomes private in the future, you'll need to add appropriate credentials.
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Check if the repository is public and accessible curl -s -o /dev/null -w "%{http_code}" https://github.com/FalkorDB/falkordb-dbaasLength of output: 87
Repository access is configured correctly for public access.
The repository returns a 200 status code, confirming it is publicly accessible. No immediate action is needed unless it becomes private; then, update to include authentication credentials.
tofu/gcp/observability_stack/control_plane/infra/outputs.tf (1)
1-17
: LGTM! Well-structured outputs with proper security considerations.The outputs are well-defined with clear names and proper handling of sensitive data. The cluster endpoint and CA certificate are correctly marked as sensitive.
argocd/app_plane/manifests/podmonitor.yaml (1)
15-16
: Consider restricting namespace scope.Monitoring pods across all namespaces (
any: true
) might be too broad. Consider limiting the scope to specific namespaces where monitoring is required.argocd/ctrl_plane/dev/manifests/letsencrypt-issuer.yaml (1)
7-9
: Verify staging environment usage.The issuer is configured to use Let's Encrypt's staging environment. This is appropriate for development but will need to be updated to production for live environments.
Make sure to update the server URL to
https://acme-v02.api.letsencrypt.org/directory
for production use.tofu/aws/2-org/variables.tf (1)
1-19
: LGTM! Well-structured variable definitions.The variables are properly defined with clear types and descriptions, following Terraform best practices.
argocd/ctrl_plane/dev/vmrules.yaml (1)
10-10
: Update targetRevision after merge.The
targetRevision
is currently set to the feature branch94-add-app-plane-observability-stack
. Remember to update this to the main branch after merging the PR to ensure continuous synchronization.argocd/ctrl_plane/dev/grafana-dashboards.yaml (1)
11-11
: Update targetRevision after merge.The
targetRevision
is currently set to the feature branch94-add-app-plane-observability-stack
. Remember to update this to the main branch after merging the PR to ensure continuous synchronization.observability/rules/prometheus-general.rules.yml (1)
1-21
: LGTM! Well-structured monitoring rules.The rules are well-designed with clear naming and appropriate label exclusions. They provide essential metrics for monitoring instance status.
argocd/ctrl_plane/dev/ksm.yaml (1)
1-26
: LGTM! Well-configured kube-state-metrics application.The configuration is well-structured with appropriate name override, node selection, and sync options.
argocd/ctrl_plane/dev/grafana-operator.yaml (1)
1-27
: LGTM! Well-structured Argo CD Application configuration.The configuration includes all necessary components:
- Proper node selection for observability workloads
- Automated sync with pruning and self-healing
- Server-side apply for better resource management
argocd/ctrl_plane/dev/manifests.yaml (2)
1-29
: LGTM! Well-configured observability stack Application.The configuration includes essential features:
- Proper namespace management
- Automated sync with pruning
- Appropriate secret handling for TLS data
10-10
: Update branch reference post-merge.The
targetRevision
is set to the feature branch94-add-app-plane-observability-stack
. This should be updated to the main branch after the PR is merged.✅ Verification successful
🏁 Script executed:
#!/bin/bash # Check if the branch exists and its status git ls-remote --heads https://github.com/FalkorDB/falkordb-dbaas.git 94-add-app-plane-observability-stackLength of output: 196
Action: Update the targetRevision Post-Merge
The verification confirms that the branch
94-add-app-plane-observability-stack
exists in the remote repository. However, since this branch is currently being used for the observability stack and the intent is to switch to a more permanent branch (such asmain
) for production post-merge, please update thetargetRevision
accordingly once the PR is merged.
- Location:
argocd/ctrl_plane/dev/manifests.yaml
(Line 10)- Recommendation: Update
"targetRevision: 94-add-app-plane-observability-stack"
to the main branch after the merge to ensure the deployment remains pointed to the stable branch.tofu/aws/3-application_plane/variables.tf (1)
1-27
: LGTM! Well-structured Terraform variables.The variables are properly defined with:
- Clear descriptions
- Appropriate types
- Sensible defaults where applicable
observability/rules/containermemoryrss.rules.yml (1)
1-21
: LGTM! Well-crafted container memory RSS monitoring rule.The rule effectively:
- Captures container memory RSS metrics
- Correlates with pod and node information
- Uses appropriate aggregations and filters
observability/rules/containermemoryswap.rules.yml (1)
1-21
: LGTM! Well-structured container memory swap monitoring rule.The rule maintains consistency with other memory metrics and properly:
- Captures container swap metrics
- Correlates with pod and node information
- Uses appropriate aggregations
observability/rules/containermemorycache.rules.yml (1)
1-21
: LGTM! Well-structured monitoring rule for container memory cache.The implementation correctly:
- Collects memory cache metrics from cAdvisor
- Filters relevant containers
- Joins with pod info for node correlation
- Records metrics with appropriate naming
observability/rules/containermemoryworkingsetbytes.rules.yml (1)
1-21
: LGTM! Well-structured monitoring rule for container memory working set.The implementation follows the same pattern as the memory cache rule, maintaining consistency in the monitoring stack.
argocd/app_plane/vmrules.yaml (1)
21-21
: Update targetRevision after merge.The
targetRevision
is currently set to the feature branch94-add-app-plane-observability-stack
. This should be updated tomain
or a specific tag after merging.argocd/app_plane/ksm.yaml (1)
26-27
: Verify observability node pool exists.Ensure that nodes with label
node_pool: observability
exist in the target clusters.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if nodes with the observability label exist kubectl get nodes --selector=node_pool=observability -o nameLength of output: 160
Manual Verification Required for "observability" Node Pool
The shell command failed because
kubectl
is not available in the current environment. Please manually verify that nodes with the labelnode_pool: observability
exist in the target cluster.observability/grafana/kustomization.yaml (2)
9-10
: Verify base kustomization exists.Ensure the
base
directory exists and contains valid Kustomize resources.✅ Verification successful
🏁 Script executed:
#!/bin/bash # Check base directory contents ls -la observability/grafana/base/ cat observability/grafana/base/kustomization.yamlLength of output: 445
Base kustomization verified.
Theobservability/grafana/base
directory exists and itskustomization.yaml
file includes the required resources (grafana-dashboards.yaml
).
18-34
: Verify dashboard JSON files exist and are valid.Ensure all referenced dashboard JSON files exist and are valid Grafana dashboard configurations.
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Check if all dashboard files exist and are valid JSON for dashboard in cluster-overview namespace-overview pod-overview falkordb-cloud prometheus-dashboard vmauth; do echo "Checking $dashboard..." if [ -f "observability/grafana/dashboards/$dashboard.json" ]; then jq '.' "observability/grafana/dashboards/$dashboard.json" > /dev/null echo "✓ Valid JSON" else echo "✗ File not found" fi doneLength of output: 2017
Dashboard JSON Files Verified
All referenced dashboard JSON files exist in the expected location and their content is valid JSON.
argocd/app_plane/grafana-operator.yaml (1)
33-34
: LGTM! Good use of ServerSideApply.The use of
ServerSideApply=true
is appropriate for the Grafana operator as it helps manage complex resources with multiple controllers.argocd/app_plane/grafana-dashboards.yaml (1)
16-16
: Validate cluster name regex patternThe current regex pattern
h?c-[A-Za-z0-9]+
might be too permissive for cluster names.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check for any unexpected cluster name patterns fd -t f -e yaml -e yml . | xargs grep -l "name:" | xargs grep "name:" | grep -E "h?c-[A-Za-z0-9]+"Length of output: 96
I've rechecked the repository using an automated search, but the initial grep command returned no hits—even though the file
argocd/app_plane/grafana-dashboards.yaml
does include a line with the regex (see below):name: '{{ regexFind "h?c-[A-Za-z0-9]+" .name }}-grafana-dashb'To double-check this specific file, please run the following script to inspect the snippet and confirm the regex appears as expected:
If the expected line is present, please manually verify whether the regex pattern
h?c-[A-Za-z0-9]+
meets the intended cluster name specification. The concern is that because the "h" is optional (h?
), the pattern could match names starting withc-
(without an initial "h"), which might be too permissive.Once you've confirmed the context and intended naming convention, you may adjust the pattern if necessary. Let me know if further automated search is needed.
🏁 Script executed:
#!/bin/bash # Print lines 1-40 to ensure the regex line is present in the file sed -n '1,40p' argocd/app_plane/grafana-dashboards.yamlLength of output: 1166
Review Regex Pattern in Grafana Dashboards Template
We confirmed that the file
argocd/app_plane/grafana-dashboards.yaml
includes the regex pattern"h?c-[A-Za-z0-9]+"
exactly as shown. Note that theh?
makes the initial "h" optional, so the pattern will match both "hc-" and "c-" prefixes. Please verify if this permissiveness aligns with our cluster naming convention. If cluster names should strictly start withhc-
, then the regex should be updated (e.g., to"hc-[A-Za-z0-9]+"
).observability/rules/containercpurequests.rules.yml (1)
1-34
: Same improvements needed as containercpulimits.rules.ymlobservability/rules/containermemorylimits.rules.yml (1)
1-34
: Same improvements needed as containercpulimits.rules.ymltofu/aws/1-bootstrap/main.tf (1)
47-58
: Review the object lock retention period.The 14-day retention period in COMPLIANCE mode is quite restrictive. Consider if this aligns with your backup and state management requirements.
argocd/ctrl_plane/dev/vmusers.yaml (1)
21-21
: Update the targetRevision to use a stable reference.The current targetRevision points to a feature branch
94-add-app-plane-observability-stack
. For production stability, consider using a tag, release, or main branch instead.argocd/ctrl_plane/dev/manifests/grafana.yaml (1)
55-56
: Verify GKE ingress class annotation.The
kubernetes.io/ingress.class
annotation is marked as updated. Verify this is the correct annotation for your GKE version.✅ Verification successful
🌐 Web query:
What is the correct ingress class annotation for GKE?
💡 Result:
The correct ingress class annotations for Google Kubernetes Engine (GKE) are
kubernetes.io/ingress.class: "gce"
for external Application Load Balancers andkubernetes.io/ingress.class: "gce-internal"
for internal Application Load Balancers[1][3][8]. Here's a detailed breakdown:Key Annotations and Usage
Annotation Load Balancer Type Behavior kubernetes.io/ingress.class: "gce"
External Application Load Balancer Routes internet-facing traffic globally[1][4][8] kubernetes.io/ingress.class: "gce-internal"
Internal Application Load Balancer Handles private VPC traffic within a region[3][8][9] Implementation Notes
Annotation vs Field
- You must use annotations (
metadata.annotations
) rather thanspec.ingressClassName
[1][3][8]- Example syntax:
metadata: annotations: kubernetes.io/ingress.class: "gce" # or "gce-internal"Default Behavior
If no annotation is specified, GKE defaults to creating an external Application Load Balancer (gce
)[1][8].Deprecation Context
While Kubernetes deprecatedkubernetes.io/ingress.class
in favor ofingressClassName
, GKE still requires the annotation method[3][7][8]. Thespec.ingressClassName
field will be ignored by GKE's controller[8].Verification and Troubleshooting
- Check existing ingress classes:
kubectl get ingressclasses- Validate YAML indentation (common error source)[3]
- For internal load balancers, ensure regional deployment alignment[9]
Comparison Table: Annotation vs Kubernetes Standard
Feature GKE Implementation Kubernetes Standard Ingress class specification Annotation required ( kubernetes.io/ingress.class
)Field-based ( spec.ingressClassName
)[7][8]Default class handling Uses annotation absence as implicit gce
Requires explicit default class marking[7][8] Controller compatibility Works only with GCE/GCE-Internal controllers Supports multiple controllers via class matching[7] For production configurations, combine these annotations with other GKE-specific settings like static IP assignment (
kubernetes.io/ingress.global-static-ip-name
)[4][6] and health check configurations viaBackendConfig
[6][8].Citations:
- 1: https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress
- 2: https://cert-manager.io/docs/tutorials/getting-started-with-cert-manager-on-google-kubernetes-engine-using-lets-encrypt-for-ingress-ssl/
- 3: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/gce-internal-ingress-fails-with-quot-Missing-one-or-more/m-p/697624
- 4: https://devopscube.com/setup-ingress-gke-ingress-controller/
- 5: https://discuss.kubernetes.io/t/gke-ingress-controller-annotations-for-proxy-body-size-buffer-size-and-sever-snippets/18055
- 6: https://engineering.sada.com/google-kubernetes-engine-ingress-configuration-for-production-environments-9677a303166a?gi=13aa784068a3
- 7: https://kubernetes.io/docs/concepts/services-networking/ingress/index.html
- 8: https://cloud.google.com/kubernetes-engine/docs/concepts/ingress?hl=en
- 9: https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/creating-a-region-application-load-balancer-using-ingress/m-p/827599/highlight/true
- 10: GKE Ingress controller ignoring ingress.class annotation kubernetes/ingress-gce#476
GKE ingress class annotation is correctly set to "gce".
Your configuration using
kubernetes.io/ingress.class: "gce"
is valid for external Application Load Balancers in GKE.
- The web query confirms that GKE requires
kubernetes.io/ingress.class: "gce"
for external load balancers.- No change is needed unless you intend to use internal load balancing, in which case you would use
"gce-internal"
.tofu/gcp/observability_stack/control_plane/k8s/main.tf (2)
54-58
: Review public repository visibility.The GitHub repository is set to public visibility. Verify if this is intentional and aligns with security requirements.
79-79
: Pin Argo CD chart version to latest stable.The current version 7.7.15 might be outdated. Consider updating to the latest stable version.
❌ Verification inconclusive
🌐 Web query:
What is the latest stable version of Argo CD Helm chart?
💡 Result:
The latest stable version of Argo CD itself as per recent releases is v2.14.2 (released on GitHub)[2]. However, the Argo CD Helm chart is maintained separately by the community and not explicitly listed in these search results[4][7].
For up-to-date Helm chart versions:
1️⃣ Check Argo Helm Charts repository (source)[9]
2️⃣ Usehelm search repo argo-cd
after adding their repo:helm repo add argo https://argoproj.github.io/argo-helmThe search results confirm that Argo CD's official documentation references community-maintained Helm charts[4][7], but no specific chart versions are mentioned in these sources[1-10]. Always verify via
helm
CLI or artifact hub for current versions.Citations:
- 1: https://argo-cd.readthedocs.io/en/stable/user-guide/helm/
- 2: https://github.com/argoproj/argo-cd/releases
- 3: https://www.cncf.io/blog/2024/11/05/mastering-argo-cd-image-updater-with-helm-a-complete-configuration-guide/
- 4: https://argo-cd.readthedocs.io/en/stable/operator-manual/installation/
- 5: https://argo-cd.readthedocs.io/en/release-2.12/user-guide/commands/argocd_app_history/
- 6: https://www.redhat.com/fr/blog/argo-cd-and-release-management-with-helm-charts-and-applicationsets
- 7: https://spacelift.io/blog/argocd-helm-chart
- 8: https://www.env0.com/blog/argocd-with-helm-charts
- 9: https://argoproj.github.io/argo-helm/
- 10: https://argo-cd.readthedocs.io/en/release-2.8/user-guide/helm/
Action Required: Update the Argo CD Helm Chart Version Pin
The current pin
"7.7.15"
appears outdated. The web query shows that while the latest stable release of Argo CD itself is v2.14.2, the Helm chart is maintained separately and its stable version isn’t clearly defined in the search results. Please verify and update the version by checking the official Argo Helm Charts repository or by running:helm repo add argo https://argoproj.github.io/argo-helm helm search repo argo-cdThen update the pin in
tofu/gcp/observability_stack/control_plane/k8s/main.tf
at line 79 accordingly.tofu/aws/3-application_plane/main.tf (1)
108-120
: Consider enabling CloudTrail module.The CloudTrail module is commented out. Consider enabling it to ensure comprehensive audit logging.
Before enabling, verify the CloudTrail configuration:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if CloudTrail is already configured in the account aws cloudtrail list-trails --region us-east-1Length of output: 141
Manual Verification Required: AWS CLI Not Available
The AWS CLI command returned a "command not found" error, so we couldn't verify the CloudTrail configuration automatically. Please run the command in an environment with AWS CLI available (or check via the AWS Console) to confirm whether any CloudTrail trails are configured. If no trails are found, enabling the CloudTrail module is recommended to ensure comprehensive audit logging.
argocd/app_plane/victoriametrics.yaml (1)
80-84
: Verify PagerDuty integration configuration.Ensure that the PagerDuty service key and endpoint are correctly configured.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if the PagerDuty secret exists kubectl get secret -n observability pagerduty-service-key # Verify PagerDuty endpoint is accessible curl -I https://events.pagerduty.com/generic/2010-04-15/create_event.jsonLength of output: 768
PagerDuty Configuration Assessment
- The PagerDuty endpoint correctly points to the expected API URL. The HTTP/1.1 405 response is expected when making a GET/HEAD request, since this endpoint only accepts POST requests with the appropriate payload.
- The check for the PagerDuty secret could not be completed in this environment (kubectl was not available). Please verify the existence of the secret (
pagerduty-service-key
) in the observability namespace on your Kubernetes cluster.argocd/ctrl_plane/dev/victoriametrics.yaml (3)
57-59
: Verify PagerDuty secret configuration.The alertmanager configuration references a secret
pagerduty-service-key
but doesn't specify how this secret should be created or managed.Please ensure that:
- The secret is created before deploying this application
- The secret contains the required key
api-key
99-100
: Verify domain configuration.The VMAuth ingress is configured with a specific domain
vmauth.observability.dev.internal.falkordb.cloud
. Ensure this domain is properly configured in your DNS and matches your environment.
39-46
: Review storage configuration.The current configuration specifies:
- 200Gi storage request
- 365 days retention period
Please verify if these values align with your data retention requirements and storage capacity planning.
tofu/gcp/observability_stack/control_plane/infra/main.tf (2)
2-3
: Address the TODO comment regarding IP range.The comment suggests changing the subnet range to /24. Please clarify the rationale and implement the change if needed.
179-182
: Review autoscaling configuration.The public node pool has a high maximum node count (220). Please verify if this limit is intentional and aligns with your scaling requirements and quotas.
observability/rules/kubernetes-storage.rules.yml (1)
1-106
: Well-structured monitoring rules!The storage monitoring rules are comprehensive, covering:
- PV space utilization
- Inode usage
- PV provisioning errors
The thresholds and durations are reasonable, and each alert includes clear descriptions and runbook URLs.
observability/rules/vmsingle.rules.yml (1)
1-122
: Well-structured monitoring rules!The VictoriaMetrics monitoring rules are comprehensive, covering:
- Disk space prediction and utilization
- API request errors
- Time series churn rate
- Insert performance
The thresholds and durations are reasonable, and each alert includes clear descriptions.
observability/rules/kubernetes-resources.rules.yml (1)
1-115
: Well-structured monitoring rules!The resource monitoring rules are comprehensive, covering:
- CPU and memory overcommitment
- Resource quota utilization
- CPU throttling
The thresholds and durations are reasonable, and each alert includes clear descriptions and runbook URLs.
observability/grafana/dashboards/prometheus-dashboard.json (1)
1-1428
: LGTM! Well-structured Prometheus dashboard with comprehensive metrics coverage.The dashboard includes:
- Resource usage metrics (CPU, Memory)
- Network performance metrics
- Storage metrics
- Proper templating for filtering
- Reasonable refresh interval (30s)
observability/grafana/dashboards/pod-overview.json (1)
24-25
: Verify dashboard refresh rate impact on PrometheusThe dashboard is configured with a 30-second refresh rate. While this provides near real-time monitoring, it could potentially impact Prometheus performance with high-cardinality metrics, especially for queries with long time ranges.
Run the following script to analyze the query complexity and cardinality:
Also applies to: 2318-2319
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check metric cardinality for dashboard queries rg -A 1 'expr:' observability/grafana/dashboards/pod-overview.json | grep -v 'expr:' | sort -u | while read -r query; do echo "Analyzing query: $query" curl -s -g "http://localhost:9090/api/v1/series" --data-urlencode "match[]=$query" | jq -r '.data | length' doneLength of output: 108
30-Second Refresh Rate and Query Cardinality Analysis – Manual Verification Needed
The script did not produce any output, so we couldn’t automatically infer whether high-cardinality queries are present. Please manually verify the dashboard’s Prometheus query configurations (e.g., search for all keys like
"query":
in the JSON) and confirm that the 30‑second refresh rate does not negatively impact Prometheus performance.observability/grafana/dashboards/vmauth.json (1)
58-59
: LGTM! Dashboard version and description are properly configuredThe dashboard is correctly versioned and includes a clear description specifying the minimum vmauth version requirement (v1.80.0).
Also applies to: 2571-2572
observability/grafana/dashboards/namespace-overview.json (2)
1-24
: LGTM! Well-structured dashboard configuration.The dashboard configuration includes proper annotations, description, and metadata. The description provides clear information about the dashboard's purpose and includes a reference to the source repository.
2610-2743
: LGTM! Well-designed templating configuration.The templating configuration is comprehensive and includes:
- Datasource selection
- Cluster filtering
- Namespace multi-select with "All" option
- Resolution options
- Created-by filter for workload selection
server: | ||
extraArgs: | ||
- --insecure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security Risk: Server configured with insecure flag.
The --insecure
flag disables TLS verification, which poses a security risk. While this might be acceptable in a development environment, it should be documented and not used in production.
Consider:
- Adding a comment explaining why insecure mode is necessary
- Using proper TLS certificates instead
- If this is strictly for development, add a warning comment about not using this in production
targetRevision: 94-add-app-plane-observability-stack | ||
path: argocd/ctrl_plane/dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid using feature branches for targetRevision.
Both applications are using a feature branch 94-add-app-plane-observability-stack
as the target revision. This is risky as:
- The branch will likely be deleted after merge
- It makes the configuration temporary and environment-specific
Consider using a stable branch (like main
) or a version tag instead:
- targetRevision: 94-add-app-plane-observability-stack
+ targetRevision: main # or a specific version tag
Also applies to: 39-40
global: | ||
image: | ||
tag: "v2.13.3" | ||
logging: | ||
format: json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify Argo CD version compatibility and security.
Let's check if v2.13.3 is the latest stable version and if there are any security advisories.
🌐 Web query:
What is the latest stable version of Argo CD and are there any known security issues with version 2.13.3?
💡 Result:
The latest stable version of Argo CD as of February 2025 is v2.14, with the most recent patch being v2.14 (released on February 3)[4]. This release follows Argo CD's quarterly minor release cadence[4].
For version v2.x.y, there are critical security concerns:
Known Security Issues in Older Versions
1️⃣ CVE-2020-8827
- Risk: High-severity brute-force login vulnerability affecting local accounts[8].
- Fix: Patched in v1.x.y (backported to supported releases)[8].
Critical Vulnerabilities Impacting Pre-vX.Y.Z Versions
- GHSA-x32m-mvfj-52xv
- Bypasses brute-force protection via DoS + memory storage flaws[5].
- Fixed in vX.Y.Z+ (e.g., vX.Y.Z requires upgrading to >=vX.Y.W)[5].
Recommended Actions
# Upgrade to latest supported release
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/vX.Y.Z/manifests/install.yaml
- Supported versions receive backported fixes[4][6].
- Unsupported versions (<vX.Y.Z) lack security updates[4].
Always verify deployments against Argo CD's security advisories.
Citations:
- 1: https://blog.argoproj.io/new-release-for-argo-cd-image-updater-0-15-d41d20a4e60e?gi=63aa42657411
- 2: https://cycode.com/blog/revealing-argo-cd-critical-vulnerability/
- 3: https://codefresh.io/blog/argo-cd-and-codefresh-gitops-security-updates-3-18-2024-preventing-brute-force-and-denial-of-service/
- 4: https://endoflife.date/argo-cd
- 5: Golang Vulnerabilities in Latest Release (2.13.3) of argocd docker image argoproj/argo-cd#21761
- 6: https://argo-cd.readthedocs.io/en/latest/operator-manual/upgrading/2.14-3.0/
- 7: https://argo-cd.readthedocs.io/en/stable/
- 8: https://argo-cd.readthedocs.io/en/stable/security_considerations/
- 9: https://github.com/argoproj/argo-cd/issues
- 10: https://github.com/argoproj/argo-cd/releases
Action Required: Update Argo CD Version
The configuration in tofu/gcp/observability_stack/control_plane/k8s/values/argocd-dev.yaml
is currently set to use the Argo CD image tag v2.13.3
. However, as of February 2025, the latest stable version is v2.14
, which includes important security patches and other fixes (e.g., addressing vulnerabilities like CVE-2020-8827 and GHSA-x32m-mvfj-52xv).
- Review the upgrade path: Verify if the deployment environment can be updated to
v2.14
without breaking compatibility. - Security evaluation: Confirm that the current version (
v2.13.3
) either has the necessary backported fixes or plan the upgrade to mitigate potential vulnerabilities.
variable "project_id" { | ||
type = string | ||
} | ||
|
||
variable "region" { | ||
type = string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance variable definitions with descriptions and validation.
The variables would benefit from descriptions and validation rules, especially for the region variable.
variable "project_id" {
+ description = "The GCP project ID where resources will be created"
type = string
}
variable "region" {
+ description = "The GCP region where resources will be created"
type = string
+ validation {
+ condition = can(regex("^[a-z]+-[a-z]+-[0-9]+$", var.region))
+ error_message = "Region must be a valid GCP region (e.g., us-central1)."
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
variable "project_id" { | |
type = string | |
} | |
variable "region" { | |
type = string | |
} | |
variable "project_id" { | |
description = "The GCP project ID where resources will be created" | |
type = string | |
} | |
variable "region" { | |
description = "The GCP region where resources will be created" | |
type = string | |
validation { | |
condition = can(regex("^[a-z]+-[a-z]+-[0-9]+$", var.region)) | |
error_message = "Region must be a valid GCP region (e.g., us-central1)." | |
} | |
} |
terraform { | ||
backend "gcs" { | ||
prefix = "observability_stack_control_plane_infra" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add required backend configuration settings.
The GCS backend configuration is missing required settings such as bucket name. Also, consider adding helpful comments.
terraform {
backend "gcs" {
+ # The name of the GCS bucket to store Terraform state
+ bucket = "YOUR_TERRAFORM_STATE_BUCKET"
prefix = "observability_stack_control_plane_infra"
}
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
terraform { | |
backend "gcs" { | |
prefix = "observability_stack_control_plane_infra" | |
} | |
} | |
terraform { | |
backend "gcs" { | |
# The name of the GCS bucket to store Terraform state | |
bucket = "YOUR_TERRAFORM_STATE_BUCKET" | |
prefix = "observability_stack_control_plane_infra" | |
} | |
} |
|
||
# Login to ArgoCD | ||
gum spin --spinner dot --title "Logging in to ArgoCD..." --show-error -- \ | ||
argocd login $ARGOCD_SERVER --username admin --password $(kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}" --context $CTRL_PLANE_CTX | base64 --decode) --insecure --plaintext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote command substitutions to prevent word splitting.
The command substitutions in kubectl get secret
commands are not quoted, which could lead to word splitting issues.
Add quotes around command substitutions:
-argocd login $ARGOCD_SERVER --username admin --password $(kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}" --context $CTRL_PLANE_CTX | base64 --decode) --insecure --plaintext
+argocd login $ARGOCD_SERVER --username admin --password "$(kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath="{.data.password}" --context $CTRL_PLANE_CTX | base64 --decode)" --insecure --plaintext
- --from-literal=password=$(kubectl get secret $CLUSTER-vmuser -n observability -o jsonpath="{.data.password}" --context $CTRL_PLANE_CTX | base64 --decode) \
+ --from-literal=password="$(kubectl get secret $CLUSTER-vmuser -n observability -o jsonpath="{.data.password}" --context $CTRL_PLANE_CTX | base64 --decode)" \
Also applies to: 148-148
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 107-107: Quote this to prevent word splitting.
(SC2046)
expr: |- | ||
# Without max_over_time, failed scrapes could create false negatives, see | ||
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. | ||
max_over_time(alertmanager_config_last_reload_successful{job="vm-alertmanager",namespace="default"}[5m]) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove hardcoded namespace from expressions.
The expressions contain hardcoded references to the "default" namespace, which might not be correct for all deployments.
Replace hardcoded namespace with a variable:
- max_over_time(alertmanager_config_last_reload_successful{job="vm-alertmanager",namespace="default"}[5m]) == 0
+ max_over_time(alertmanager_config_last_reload_successful{job="vm-alertmanager"}[5m]) == 0
Apply similar changes to other expressions.
Also applies to: 33-35, 46-48, 61-63, 76-78, 91-91, 105-105, 124-124
prometheus.remote_write "ctrl_plane" { | ||
endpoint { | ||
url = "https://vmauth.observability.dev.internal.falkordb.cloud/prometheus/api/v1/write" | ||
|
||
basic_auth { | ||
username = "{{ regexFind "h?c-[A-Za-z0-9]+" .name }}-user" | ||
password_file = "/var/run/secrets/victoriametrics/vmuser/password" | ||
} | ||
|
||
tls_config { | ||
insecure_skip_verify = true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: TLS verification is disabled.
The insecure_skip_verify = true
setting disables TLS certificate verification, which could expose the system to man-in-the-middle attacks. Consider properly configuring TLS with valid certificates.
tls_config {
- insecure_skip_verify = true
+ insecure_skip_verify = false
+ ca_file = "/path/to/ca.crt"
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
prometheus.remote_write "ctrl_plane" { | |
endpoint { | |
url = "https://vmauth.observability.dev.internal.falkordb.cloud/prometheus/api/v1/write" | |
basic_auth { | |
username = "{{ regexFind "h?c-[A-Za-z0-9]+" .name }}-user" | |
password_file = "/var/run/secrets/victoriametrics/vmuser/password" | |
} | |
tls_config { | |
insecure_skip_verify = true | |
} | |
} | |
prometheus.remote_write "ctrl_plane" { | |
endpoint { | |
url = "https://vmauth.observability.dev.internal.falkordb.cloud/prometheus/api/v1/write" | |
basic_auth { | |
username = "{{ regexFind "h?c-[A-Za-z0-9]+" .name }}-user" | |
password_file = "/var/run/secrets/victoriametrics/vmuser/password" | |
} | |
tls_config { | |
insecure_skip_verify = false | |
ca_file = "/path/to/ca.crt" | |
} | |
} |
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 38-38: trailing spaces
(trailing-spaces)
[error] 43-43: trailing spaces
(trailing-spaces)
"expr": "sum(increase(container_oom_events_total{namespace=\"${namespace}\", pod=\"${pod}\", container!=\"\", cluster=\"$cluster\"}[$__rate_interval])) by (container)", | ||
"interval": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add alert thresholds for critical metrics
The container restart and OOM event panels track critical pod health indicators but lack alert thresholds.
Add alert thresholds to highlight when these metrics exceed acceptable values:
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
+ {
+ "color": "yellow",
+ "value": 1
+ },
{
"color": "red",
- "value": 80
+ "value": 3
}
]
}
Also applies to: 1862-1863
"expr": "sum(rate(vmauth_http_request_errors_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (reason)", | ||
"hide": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add severity-based thresholds for log errors
The log error panel tracks all non-info level messages but doesn't differentiate between warning and error severities.
Add color-coded thresholds based on log level severity:
{
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
+ {
+ "color": "yellow",
+ "value": 1,
+ "label": "warnings"
+ },
{
"color": "red",
+ "value": 1,
+ "label": "errors"
}
]
}
},
+ "overrides": [
+ {
+ "matcher": {
+ "id": "byName",
+ "options": "level=\"warning\""
+ },
+ "properties": [
+ {
+ "id": "color",
+ "value": {
+ "fixedColor": "yellow",
+ "mode": "fixed"
+ }
+ }
+ ]
+ }
+ ]
}
}
Also applies to: 2447-2448
User description
fix #94
PR Type
Enhancement, Documentation, Configuration changes
Description
Introduced a comprehensive observability stack for both GCP and AWS platforms.
Added Terraform configurations for infrastructure, Kubernetes control plane, and application plane observability.
Defined multiple Grafana dashboards for monitoring Kubernetes clusters, namespaces, and specific applications like FalkorDB.
Added VictoriaMetrics rules for Kubernetes application monitoring, pod owner relabeling, and general observability.
Configured ArgoCD applications for deploying observability components such as VictoriaMetrics and Grafana Operator.
Included scripts for automating cluster addition to the observability stack.
Added documentation for setting up the application plane observability stack.
Introduced Terraform backend configurations and lock files for consistent dependency management.
Changes walkthrough 📝
23 files
.terraform.lock.hcl
Add Terraform lock file for Kubernetes control plane.
tofu/gcp/observability_stack/control_plane/k8s/.terraform.lock.hcl
control plane.
reproducibility.
.terraform.lock.hcl
Add Terraform lock file for infrastructure setup.
tofu/gcp/observability_stack/control_plane/infra/.terraform.lock.hcl
.terraform.lock.hcl
Add Terraform lock file for AWS bootstrap.
tofu/aws/1-bootstrap/.terraform.lock.hcl
.terraform.lock.hcl
Add Terraform lock file for AWS application plane.
tofu/aws/3-application_plane/.terraform.lock.hcl
.terraform.lock.hcl
Add Terraform lock file for AWS organization.
tofu/aws/2-org/.terraform.lock.hcl
variables.tf
Add variables for GCP Kubernetes control plane.
tofu/gcp/observability_stack/control_plane/k8s/variables.tf
variables.tf
Add variables for AWS application plane.
tofu/aws/3-application_plane/variables.tf
providers.tf
Define required providers for Kubernetes control plane.
tofu/gcp/observability_stack/control_plane/k8s/providers.tf
providers.
variables.tf
Add variables for AWS bootstrap configuration.
tofu/aws/1-bootstrap/variables.tf
variables.tf
Add variables for AWS organization configuration.
tofu/aws/2-org/variables.tf
outputs.tf
Add outputs for GCP infrastructure resources.
tofu/gcp/observability_stack/control_plane/infra/outputs.tf
bucket.
providers.tf
Define required providers for GCP infrastructure.
tofu/gcp/observability_stack/control_plane/infra/providers.tf
providers.tf
Define required providers for AWS bootstrap.
tofu/aws/1-bootstrap/providers.tf
backend.tf
Configure Terraform backend for AWS bootstrap.
tofu/aws/1-bootstrap/backend.tf
variables.tf
Add variables for GCP infrastructure configuration.
tofu/gcp/observability_stack/control_plane/infra/variables.tf
backend.tf
Configure Terraform backend for Kubernetes control plane.
tofu/gcp/observability_stack/control_plane/k8s/backend.tf
backend.tf
Configure Terraform backend for GCP infrastructure.
tofu/gcp/observability_stack/control_plane/infra/backend.tf
outputs.tf
Add output for AWS organizational unit ID.
tofu/aws/1-bootstrap/outputs.tf
backend.tf
Configure Terraform backend for AWS application plane.
tofu/aws/3-application_plane/backend.tf
backend.tf
Configure Terraform backend for AWS organization module.
tofu/aws/2-org/backend.tf
kustomization.yaml
Add Kustomize configuration for Grafana dashboards.
observability/grafana/kustomization.yaml
victoriametrics.yaml
Added VictoriaMetrics Deployment Configuration via ArgoCD
argocd/ctrl_plane/dev/victoriametrics.yaml
stack.
alertmanager, and vmauth.
observability resources.
grafana-operator.yaml
Added Grafana Operator Deployment Configuration via ArgoCD
argocd/app_plane/grafana-operator.yaml
Operator.
policies.
13 files
main.tf
Define GCP infrastructure for observability stack.
tofu/gcp/observability_stack/control_plane/infra/main.tf
main.tf
Define AWS application plane observability resources.
tofu/aws/3-application_plane/main.tf
main.tf
Configure Kubernetes and Helm for GCP control plane.
tofu/gcp/observability_stack/control_plane/k8s/main.tf
main.tf
Define AWS bootstrap resources for Terraform state.
tofu/aws/1-bootstrap/main.tf
state.
main.tf
Define AWS organization resources for workloads.
tofu/aws/2-org/main.tf
add_cluster.sh
Add script for automated cluster addition.
scripts/add_cluster.sh
prometheus-dashboard.json
Add Grafana dashboard for Prometheus metrics.
observability/grafana/dashboards/prometheus-dashboard.json
alertmanager.rules.yml
Add Alertmanager rules for observability.
observability/rules/alertmanager.rules.yml
failures.
cluster-overview.json
Added Grafana dashboard for Kubernetes cluster observability.
observability/grafana/dashboards/cluster-overview.json
monitoring.
utilization.
transformations.
resolution.
kubernetes-apps.rules.yml
Added VictoriaMetrics rules for Kubernetes application monitoring.
observability/rules/kubernetes-apps.rules.yml
monitoring.
HPA issues.
severity levels.
podowner.rules.yml
Added VictoriaMetrics rules for Kubernetes pod owner relabeling.
observability/rules/podowner.rules.yml
jobs.
information.
namespace-overview.json
Added Kubernetes Namespace Overview Grafana Dashboard
observability/grafana/dashboards/namespace-overview.json
namespace observability.
namespaces.
namespaces and clusters.
stat panels.
falkordb-cloud.json
Added FalkorDB Cloud Grafana Dashboard
observability/grafana/dashboards/falkordb-cloud.json
observability.
network I/O.
namespaces and pods.
1 files
main.tf
Adjust formatting in GCP bootstrap module.
tofu/gcp/bootstrap/seed_project/main.tf
1 files
README
Add README for application plane observability stack.
argocd/app_plane/README
64 files
Summary by CodeRabbit
New Features
Documentation
Chores