-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add observability #98
Conversation
- bootstrap - org - app plane comment out actions
remove flows
thanos endpoint
fix namespace
labeldrop
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 31
🧹 Nitpick comments (61)
observability/rules/podowner.rules.yml (2)
13-30
: Consider documenting the ReplicaSet selection logic.The rule correctly maps pods to their Deployment owners through ReplicaSets, but the
topk
operation's role in selecting the most recent ReplicaSet could benefit from documentation.Add a comment explaining the ReplicaSet selection logic:
- annotations: {} + # Uses topk to select the most recent ReplicaSet for each Deployment, + # ensuring accurate pod ownership during rolling updates expr: |-
1-64
: Consider documenting metric usage patterns.These pod ownership metrics (
namespace_workload_pod:kube_pod_owner:relabel
) can be valuable for:
- Tracking workload distribution across namespaces
- Monitoring pod ownership changes during deployments
- Analyzing workload patterns across clusters
Consider documenting these use cases in a README or in the related Grafana dashboard configurations.
argocd/ctrl_plane/dev/victoriametrics.yaml (1)
37-53
: Consider adjusting storage and retention settings.The VMSingle configuration has some points to consider:
- 365 days retention with 200Gi storage might not be sufficient depending on your metrics volume
- Internal load balancer is correctly configured for GCP
Monitor your storage usage and metrics ingestion rate to ensure the 200Gi storage allocation is sufficient for the 365-day retention period. Consider implementing a monitoring alert for storage capacity.
tofu/gcp/observability_stack/control_plane/k8s/variables.tf (2)
1-7
: Add descriptions and validation rules for GCP variables.The GCP-specific variables lack descriptions and validation rules. Consider adding:
- Description blocks to document the purpose and expected values
- Validation rules to ensure valid GCP project ID format and supported regions
variable "project_id" { type = string + description = "The GCP project ID where resources will be created" + validation { + condition = can(regex("^[a-z][a-z0-9-]{4,28}[a-z0-9]$", var.project_id)) + error_message = "Project ID must be between 6 and 30 characters, start with a letter, and contain only lowercase letters, numbers, and hyphens" + } } variable "region" { type = string + description = "The GCP region where resources will be created (e.g., us-central1)" + validation { + condition = can(regex("^[a-z]+-[a-z]+\\d+$", var.region)) + error_message = "Region must be a valid GCP region (e.g., us-central1)" + } }
9-19
: Add descriptions and validation rules for Kubernetes cluster variables.The Kubernetes cluster variables lack descriptions and validation rules. Consider adding:
- Description blocks to document the purpose and expected values
- Validation rules to ensure valid formats
variable "cluster_endpoint" { type = string + description = "The endpoint URL of the GKE cluster" + validation { + condition = can(regex("^https://", var.cluster_endpoint)) + error_message = "Cluster endpoint must be a valid HTTPS URL" + } } variable "cluster_ca_certificate" { type = string + description = "The base64 encoded public certificate for the cluster's certificate authority" + validation { + condition = can(base64decode(var.cluster_ca_certificate)) + error_message = "Cluster CA certificate must be base64 encoded" + } } variable "cluster_name" { type = string + description = "The name of the GKE cluster" + validation { + condition = can(regex("^[a-z0-9-]+$", var.cluster_name)) + error_message = "Cluster name must contain only lowercase letters, numbers, and hyphens" + } }argocd/ctrl_plane/dev/alloy.yaml (2)
283-288
: Review resource allocation.The resource requests and limits look reasonable for a control plane deployment:
- Requests: 500m CPU, 2Gi memory
- Limits: 2 CPU, 6Gi memory
However, consider the following:
- The memory limit is 3x the request, which might indicate uncertainty about actual usage.
- Monitor actual resource usage to fine-tune these values.
254-258
: Consider maintaining the metric allowlist in a separate ConfigMap.The regex for allowed metrics is quite long and might be difficult to maintain. Consider moving it to a separate ConfigMap for better maintainability.
tofu/gcp/observability_stack/control_plane/infra/variables.tf (1)
1-7
: Add descriptions and validation rules for variables.While the variables are correctly defined, they would benefit from:
- Descriptions to document their purpose
- Validation rules to ensure proper values
Apply this diff to improve the variable definitions:
variable "project_id" { type = string + description = "The GCP project ID where resources will be created" + validation { + condition = can(regex("^[a-z][a-z0-9-]{4,28}[a-z0-9]$", var.project_id)) + error_message = "Project ID must be between 6 and 30 characters, start with a letter, and contain only lowercase letters, numbers, and hyphens." + } } variable "region" { type = string + description = "The GCP region where resources will be created" + validation { + condition = can(regex("^[a-z]+-[a-z]+[0-9]$", var.region)) + error_message = "Region must be a valid GCP region (e.g., us-central1, europe-west1)." + } }observability/grafana/base/kustomization.yaml (1)
1-5
: Fix YAML formatting issues.The Kustomization is correctly defined but has formatting issues:
- Line 4: Remove trailing spaces
- Add a newline at the end of the file
Apply this diff to fix the formatting:
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization -resources: - - grafana-dashboards.yaml \ No newline at end of file +resources: + - grafana-dashboards.yaml +🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 4-4: trailing spaces
(trailing-spaces)
[error] 5-5: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml (1)
20-20
: Add newline at end of file.Add a newline character at the end of the file to comply with YAML best practices.
url: http://vmsingle-vm.observability.svc.cluster.local:8429 +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/dev/manifests/grafana-datasource.yaml (1)
20-20
: Add newline at end of file.Add a newline character at the end of the file to comply with YAML best practices.
url: http://vmsingle-vm-victoria-metrics-k8s-stack.observability.svc.cluster.local:8429 +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
tofu/aws/1-bootstrap/variables.tf (1)
18-21
: Add validation for email format.Consider adding a validation block to ensure the email format is correct.
variable "account_email" { type = string description = "The email of the account to create." + validation { + condition = can(regex("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", var.account_email)) + error_message = "The account_email must be a valid email address." + } }observability/rules/prometheus-general.rules.yml (2)
13-16
: Enhance rule annotations for better observability.Consider adding helpful annotations to provide context about the rule's purpose and potential actions.
- - annotations: {} + - annotations: + description: "Count of instances that are currently up" + summary: "Total count of healthy instances" expr: count without(instance, pod, node) (up == 1) labels: {} record: count:up1
17-20
: Add annotations for down instances rule.Similarly, enhance the down instances rule with descriptive annotations.
- - annotations: {} + - annotations: + description: "Count of instances that are currently down" + summary: "Total count of unhealthy instances" expr: count without(instance, pod, node) (up == 0) labels: {} record: count:up0argocd/ctrl_plane/dev/certmanager.yaml (1)
22-25
: Consider enabling prune option in automated sync policyThe current sync policy might leave orphaned resources when configurations are removed. Consider adding
prune: true
to the automated sync policy for better cleanup:syncPolicy: - automated: {} + automated: + prune: true syncOptions: - CreateNamespace=true🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
tofu/aws/2-org/main.tf (1)
12-13
: Consider customizing the IAM role name for enhanced securityUsing the default "OrganizationAccountAccessRole" name is predictable. Consider using a custom role name for better security posture.
tofu/aws/3-application_plane/variables.tf (1)
1-26
: LGTM! Well-defined Terraform variables with clear descriptions.The variables are well-structured with appropriate types and descriptions.
Consider adding validation rules for critical variables.
For better error handling and security, consider adding validation rules:
variable "cloudtrail_retention_days" { type = number description = "Number of days to retain CloudTrail logs" default = 90 + validation { + condition = var.cloudtrail_retention_days >= 90 + error_message = "CloudTrail retention days must be at least 90 days for compliance." + } } variable "workloads_ou_name" { type = string description = "Name of the OU for workloads" + validation { + condition = can(regex("^[a-zA-Z0-9-_]+$", var.workloads_ou_name)) + error_message = "OU name must contain only alphanumeric characters, hyphens, and underscores." + } }observability/rules/containermemorycache.rules.yml (1)
1-21
: LGTM! Well-structured VictoriaMetrics recording rules.The rules follow best practices:
- Consistent naming convention across rules
- Proper metric selection and filtering
- Efficient use of topk for aggregation
Consider adding alert thresholds for critical memory metrics.
While the recording rules are well-defined, consider adding alert rules to proactively notify when memory metrics exceed certain thresholds.
Example alert rule structure:
- alert: ContainerHighMemoryUsage expr: node_namespace_pod_container:container_memory_rss > 0.9 * container_spec_memory_limit_bytes for: 5m labels: severity: warning annotations: summary: High memory usage for container ({{ $labels.container }}) description: Container {{ $labels.container }} in pod {{ $labels.pod }} has high memory usageAlso applies to: 1-21, 1-21
observability/rules/containermemoryworkingsetbytes.rules.yml (1)
14-20
: Add documentation for the recorded metric.The rule would benefit from annotations explaining:
- The purpose of the recorded metric
- The units of measurement
- Expected value ranges
- Any overhead considerations
Add annotations to document the metric:
- annotations: {} + annotations: + description: "Container memory working set bytes by node, namespace, pod, and container" + summary: "Records the working set size of containers in bytes" + unit: "bytes" expr: |-observability/rules/node-network.rules.yml (1)
13-21
: Enhance alert context and severity handling.The alert is well-documented but could be improved by:
- Adding more context about the impact
- Considering different severity levels based on interface type
Enhance the alert with more context:
- alert: NodeNetworkInterfaceFlapping annotations: description: Network interface "{{ $labels.device }}" changing its up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }} + impact: "Network instability may affect pod connectivity and service availability" runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/nodenetworkinterfaceflapping summary: Network interface is often changing its status expr: changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2 for: 2m labels: - severity: warning + severity: '{{ if $labels.device =~ "^(eth0|ens|eno)" }}critical{{ else }}warning{{ end }}'observability/rules/containercpuusagesecondstotal.rules.yml (1)
13-21
: Document metric and consider using rate instead of irate.The rule would benefit from:
- Documentation about the recorded metric
- Consider using
rate
instead ofirate
for more stable measurements over longer periodsAdd documentation and consider using rate:
- annotations: {} + annotations: + description: "Container CPU usage in seconds by node, namespace, pod, and container" + summary: "Records the rate of CPU usage over 5-minute windows" + unit: "seconds" expr: |- sum by (namespace,pod,container,cluster) ( - irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m]) + rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m]) ) * on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) (argocd/app_plane/dev/vmrules.yaml (1)
28-33
: Add health checks and resource management.Consider enhancing the sync policy with:
- Health check configuration
- Resource limits for the ApplicationSet controller
Add health checks and resource management:
syncPolicy: syncOptions: - CreateNamespace=true + - PruneLast=true + - RespectIgnoreDifferences=true automated: prune: true selfHeal: true + info: + - name: Health + value: | + Conditions: + - type: ResourcesUpToDate + status: "True" + - type: Synced + status: "True" + resources: + limits: + cpu: 200m + memory: 256Miargocd/app_plane/dev/applicationset.yaml (1)
28-33
: Add retry strategy and progress deadline.Consider enhancing the sync policy with:
- Retry strategy for failed synchronizations
- Progress deadline for sync operations
Add retry and deadline configurations:
syncPolicy: syncOptions: - CreateNamespace=true + - RetryMaxDuration=5m + - ProgressDeadlineSeconds=600 automated: prune: true selfHeal: true + allowEmpty: false + retry: + limit: 5 + backoff: + duration: 5s + factor: 2 + maxDuration: 3mobservability/grafana/kustomization.yaml (1)
2-5
: Consider enabling name suffix hash for dashboard versioning.While disabling the name suffix hash makes the ConfigMap names predictable, enabling it would help with versioning and rolling updates of dashboards. This is especially important in a production environment where dashboard changes need to be tracked and rolled back if needed.
generatorOptions: - disableNameSuffixHash: true + disableNameSuffixHash: false labels: grafana_dashboard: "1"observability/rules/containercpurequests.rules.yml (1)
1-34
: Consider consolidating CPU metrics recording rules.The implementation is correct but shares significant logic with
containercpulimits.rules.yml
. Consider using a single rule group with parameterized metrics to reduce duplication.Example consolidation:
groups: - name: k8s.rules.container_cpu rules: - record: cluster:namespace:pod_cpu:active:kube_pod_container_resource_${type} expr: | kube_pod_container_resource_${type}{resource="cpu",job="kube-state-metrics"} * on(namespace,pod,cluster) group_left() max by(namespace,pod,cluster) ( kube_pod_status_phase{phase=~"Pending|Running"} == 1 )tofu/aws/1-bootstrap/main.tf (1)
31-58
: Add resource tags for cost tracking.Add tags to the S3 bucket and related resources for better cost allocation and resource management.
resource "aws_s3_bucket" "terraform_state" { bucket = "tf-state-${random_id.suffix.hex}" provider = aws.seed-account + tags = { + Environment = "bootstrap" + Purpose = "terraform-state" + ManagedBy = "terraform" + } }observability/rules/containermemoryrequests.rules.yml (1)
13-20
: Add descriptive annotations to recording rules.Consider adding descriptive annotations to both recording rules to improve maintainability and documentation. This helps other team members understand the purpose and usage of these metrics.
- annotations: {} + annotations: + description: "Active memory requests for pods in Pending/Running state" + summary: "Pod memory requests" expr: |- kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"} * on (namespace,pod,cluster) group_left() max by (namespace,pod,cluster) ( (kube_pod_status_phase{phase=~"Pending|Running"} == 1) ) labels: {} record: cluster:namespace:pod_memory:active:kube_pod_container_resource_requests - annotations: {} + annotations: + description: "Total memory requests per namespace for active pods" + summary: "Namespace memory requests" expr: |- sum by (namespace,cluster) (Also applies to: 21-33
observability/grafana/base/grafana-dashboards.yaml (1)
72-72
: Add newline at end of file.Add a newline character at the end of the file to comply with POSIX standards.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 72-72: no new line character at the end of file
(new-line-at-end-of-file)
observability/rules/prometheus-node-recording.rules.yml (1)
13-36
: Enhance metric observability with additional metadata.Consider the following improvements:
- Add descriptive annotations for each recording rule
- Include additional labels for better filtering (e.g.,
job
,instance_type
)- annotations: {} + annotations: + description: "Rate of CPU usage excluding idle, iowait, and steal time" + summary: "Node CPU usage rate" expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) BY (instance,cluster) - labels: {} + labels: + job: "node" record: instance:node_cpu:rate:sumobservability/rules/node.rules.yml (1)
13-20
: Add documentation for the recording rule.Consider adding a description annotation to explain what this metric represents and how it's used.
- annotations: {} + annotations: + description: "Records the mapping between pods and nodes for use in other recording rules"argocd/ctrl_plane/dev/manifests/grafana.yaml (1)
22-24
: Consider using non-root init container.While running the init container as root works, consider using a more restrictive security context by:
- Using a custom image with the correct user/group pre-configured
- Using fsGroup to set permissions instead
argocd/app_plane/dev/victoriametrics.yaml (1)
136-136
: Add newline at end of file.Add a newline character at the end of the file to comply with POSIX standards.
.webhooks[]?.clientConfig.caBundle" +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 136-136: no new line character at the end of file
(new-line-at-end-of-file)
observability/rules/kubernetes-storage.rules.yml (1)
1-106
: LGTM! The alerts are well-defined and cover critical storage scenarios.The alerts effectively monitor Persistent Volume health with appropriate thresholds and predictive alerts. The exclusion of ReadOnlyMany PVCs and PVCs with excluded_from_alerts label shows thoughtful implementation.
Consider adding the following improvements:
- Add alerts for StorageClass provisioner health to detect issues with dynamic provisioning.
- Add alerts for volume snapshot operations if using CSI snapshots.
Example alert for StorageClass provisioner:
- alert: KubeStorageClassProvisionerDown annotations: description: Storage class {{ $labels.storage_class }} provisioner is down or not responding. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestorageclassprovisionerdown summary: Storage class provisioner is down. expr: | kube_storageclass_info{provisioner!=""} unless on (provisioner) sum by (provisioner) (up{job=~".*csi.*"}) > 0 for: 5m labels: severity: criticalobservability/rules/vmsingle.rules.yml (1)
1-122
: LGTM! The alerts effectively monitor VictoriaMetrics health and performance.The alerts cover critical aspects like disk space, API health, churn rate, and insert performance with appropriate thresholds and durations.
Consider adding the following improvements:
- Add alerts for backup/snapshot operations if configured.
- Add alerts for query performance degradation.
Example alert for query performance:
- alert: TooHighQueryLatency annotations: dashboard: grafana.domain.com/d/wNf0q_kZk?viewPanel=35&var-instance={{ $labels.instance }} description: Query latency is higher than expected on instance {{ $labels.instance }}. summary: High query latency on {{ $labels.instance }} expr: | histogram_quantile(0.9, sum(rate(vm_request_duration_seconds_bucket{path="/select/0/prometheus/api/v1/query"}[5m])) by (le,instance)) > 1 for: 15m labels: severity: warningobservability/rules/kubernetes-resources.rules.yml (1)
1-115
: LGTM! The alerts provide comprehensive monitoring of Kubernetes resource usage.The alerts effectively monitor resource usage at both cluster and namespace levels with appropriate thresholds and durations.
Consider adding the following improvements:
- Add alerts for resource requests vs limits ratio.
- Add alerts for resource fragmentation.
Example alert for resource requests vs limits ratio:
- alert: KubeContainerResourceRatio annotations: description: Container {{ $labels.container }} in pod {{ $labels.pod }} has resource requests significantly lower than limits. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerresourceratio summary: Container resource requests vs limits ratio is too low. expr: | sum(kube_pod_container_resource_limits{resource="cpu"}) by (namespace,pod,container) / sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace,pod,container) > 4 for: 15m labels: severity: warningobservability/rules/vm-health.rules.yml (1)
1-110
: LGTM! The alerts provide comprehensive monitoring of VictoriaMetrics health.The alerts effectively monitor various aspects of VictoriaMetrics health including process metrics, resource usage, and operational metrics.
Consider adding the following improvements:
- Add alerts for network connectivity issues.
- Add alerts for cache efficiency.
Example alert for network connectivity:
- alert: NetworkConnectivityIssues annotations: description: Network connectivity issues detected for {{ $labels.instance }}. summary: Network connectivity issues on {{ $labels.instance }} expr: | rate(vm_http_request_errors_total{code=~"5..|408"}[5m]) / rate(vm_http_requests_total[5m]) > 0.05 for: 15m labels: severity: warningscripts/add_cluster.sh (1)
1-155
: Consider adding input validation and error recovery.While the script has good error handling, it could benefit from additional robustness.
- Add validation for required inputs:
validate_input() { local var_name="$1" local var_value="$2" if [ -z "$var_value" ]; then echo "Error: $var_name cannot be empty" exit 1 fi } validate_input "ARGOCD_SERVER" "$ARGOCD_SERVER" validate_input "PAGERDUTY_API_KEY" "$PAGERDUTY_API_KEY" # Add validation for other inputs
- Add cleanup function for temporary files:
cleanup() { rm -f /tmp/review.txt } trap cleanup EXIT
- Add retries for network operations:
retry() { local retries=3 local count=0 until "$@"; do exit=$? count=$((count + 1)) if [ $count -lt $retries ]; then echo "Retry attempt $count/$retries" sleep 5 else return $exit fi done return 0 }🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 23-23: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 24-24: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 25-25: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 28-28: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 29-29: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 30-30: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 31-31: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 33-33: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 34-34: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 35-35: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 36-36: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 37-37: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 107-107: Quote this to prevent word splitting.
(SC2046)
[warning] 148-148: Quote this to prevent word splitting.
(SC2046)
observability/rules/alertmanager.rules.yml (2)
21-21
: Consider parameterizing the namespace in metric queries.The namespace "default" is hardcoded in the expression. Consider using a variable or label matcher to make the rule more reusable across different namespaces.
-max_over_time(alertmanager_config_last_reload_successful{job="vm-alertmanager",namespace="default"}[5m]) == 0 +max_over_time(alertmanager_config_last_reload_successful{job="vm-alertmanager"}[5m]) == 0
33-35
: Consider adding failure threshold to description.The alert description could be more informative by including the expected number of members.
- description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} has only found {{ $value }} members of the {{$labels.job}} cluster. + description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} has only found {{ $value }} members of the {{$labels.job}} cluster (expected {{ $labels.expected_members }}).observability/rules/kubernetes-apps.rules.yml (2)
18-18
: Consider adjusting the expression for more accurate crash detection.The current expression might catch brief CrashLoopBackOff states. Consider using rate() to detect sustained crash loops.
-max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics", namespace=~".*"}[5m]) >= 1 +rate(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics", namespace=~".*"}[5m]) > 0
30-30
: Consider filtering out specific phases.The current expression includes all non-ready phases. Consider explicitly listing the phases you want to alert on.
- kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Pending|Unknown|Failed"} + kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Failed|Unknown"}observability/rules/node-exporter.rules.yml (2)
367-367
: Consider adding warning threshold for memory utilization.The current expression only alerts at 90% utilization. Consider adding a warning threshold at a lower percentage.
-100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > 90 +100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > 80
378-378
: Consider adding rate limiting to disk I/O alerts.The current expression might be too sensitive to brief I/O spikes. Consider using avg_over_time for more stable alerting.
-rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m]) > 10 +avg_over_time(rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m])[15m:]) > 10argocd/app_plane/dev/alloy.yaml (1)
289-294
: Review resource limits configuration.The current CPU limit (2 cores) to request (500m) ratio is 4:1, which might lead to CPU throttling. Consider adjusting the ratio to be closer to 2:1 or 3:1.
requests: cpu: 500m memory: 2Gi limits: - cpu: "2" + cpu: "1" memory: 6Giobservability/grafana/dashboards/falkordb-cloud.json (2)
98-98
: Consider using a more precise interval for uptime calculation.The current query uses
$__interval
which might be too coarse for accurate uptime tracking.- "expr": "max(max_over_time(___customer_redis_uptime_in_seconds{pod=~\"$pod\"}[$__interval]))", + "expr": "max(___customer_redis_uptime_in_seconds{pod=~\"$pod\"})",
975-975
: Optimize the command duration calculation.The current query uses two separate aggregations which might be inefficient. Consider using a single metric if available.
- "expr": "sum(irate(___customer_redis_commands_duration_seconds_total{pod =~ \"$pod\"}[1m])) by (cmd)\n /\nsum(irate(___customer_redis_commands_total{pod =~ \"$pod\"}[1m])) by (cmd)\n", + "expr": "rate(___customer_redis_commands_duration_seconds_total{pod =~ \"$pod\"}[1m]) / rate(___customer_redis_commands_total{pod =~ \"$pod\"}[1m])",observability/grafana/dashboards/prometheus-dashboard.json (4)
1418-1421
: Consider adjusting time range and refresh settings for better monitoring.The current configuration might need adjustments:
- The default time range of 15 minutes might be too short for trend analysis
- The 30-second refresh interval could be resource-intensive for larger clusters
Consider these changes:
"time": { - "from": "now-15m", + "from": "now-3h", "to": "now" }, "refresh": "30s",Also applies to: 1290-1291
369-370
: Enhance resource monitoring resilience.The CPU and memory monitoring queries could be more robust by handling cases where resource requests/limits are not set.
Consider adding fallback logic:
- "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=~\"$pod\", image!=\"\", cluster=\"$cluster\"}[$__rate_interval])) / sum(kube_pod_container_resource_requests{namespace=\"$namespace\", pod=~\"$pod\", resource=\"cpu\", job=~\"$job\", cluster=\"$cluster\"})", + "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=~\"$pod\", image!=\"\", cluster=\"$cluster\"}[$__rate_interval])) / (sum(kube_pod_container_resource_requests{namespace=\"$namespace\", pod=~\"$pod\", resource=\"cpu\", job=~\"$job\", cluster=\"$cluster\"}) > 0 or vector(1))",Also applies to: 513-514, 587-588
1884-2315
: Enhance network monitoring coverage.While the current network metrics are good, consider adding:
- TCP connection states
- Protocol-specific metrics (HTTP, gRPC)
- Connection tracking metrics
Add these metrics:
+ { + "expr": "sum(container_network_tcp_usage_total{namespace=\"$namespace\", pod=~\"$pod\", cluster=\"$cluster\"}) by (state)", + "legendFormat": "TCP {{ state }}", + "interval": "$resolution", + "refId": "C" + }
499-604
: Add volume performance metrics.The storage monitoring could be enhanced with performance metrics:
- Volume throughput
- Volume latency
- I/O operations
Add these metrics:
+ { + "expr": "rate(kubelet_volume_stats_read_bytes_total{namespace=\"$namespace\", persistentvolumeclaim=~\".*prom.*\", cluster=\"$cluster\"}[$__rate_interval])", + "legendFormat": "{{ persistentvolumeclaim }} - Read Throughput", + "refId": "C" + }observability/grafana/dashboards/pod-overview.json (4)
29-29
: Enhance dashboard navigation and filtering.Consider adding:
- Links to related dashboards (Nodes, Namespaces)
- Additional template variables (Node, Container)
Add dashboard links:
- "links": [], + "links": [ + { + "title": "Node Overview", + "url": "/d/k8s_views_nodes/kubernetes-views-nodes?var-datasource=${datasource}&var-cluster=${cluster}" + }, + { + "title": "Namespace Overview", + "url": "/d/k8s_views_namespaces/kubernetes-views-namespaces?var-datasource=${datasource}&var-cluster=${cluster}" + } + ],
44-158
: Add container state and pod condition metrics.Enhance the status monitoring with:
- Container state transitions
- Pod conditions (Ready, PodScheduled, etc.)
Add these metrics:
+ { + "expr": "kube_pod_container_state_started{namespace=\"$namespace\", pod=\"$pod\", cluster=\"$cluster\"}", + "legendFormat": "{{ container }} - Started", + "refId": "B" + }, + { + "expr": "kube_pod_status_condition{namespace=\"$namespace\", pod=\"$pod\", cluster=\"$cluster\"}", + "legendFormat": "{{ condition }} - {{ status }}", + "refId": "C" + }
447-746
: Add memory pressure and cgroup metrics.Consider enhancing resource monitoring with:
- Memory pressure metrics
- Cgroup metrics for better resource tracking
Add these metrics:
+ { + "expr": "container_memory_pressure_level{namespace=\"$namespace\", pod=\"$pod\", cluster=\"$cluster\"}", + "legendFormat": "{{ container }} - Memory Pressure", + "refId": "C" + }, + { + "expr": "container_memory_cgroup_usage_bytes{namespace=\"$namespace\", pod=\"$pod\", cluster=\"$cluster\"}", + "legendFormat": "{{ container }} - Cgroup Memory", + "refId": "D" + }
1884-2315
: Add interface-specific and DNS metrics.Consider enhancing network monitoring with:
- Interface-specific metrics
- DNS query metrics
Add these metrics:
+ { + "expr": "container_network_receive_bytes_total{namespace=\"$namespace\", pod=\"$pod\", cluster=\"$cluster\"} by (interface)", + "legendFormat": "{{ interface }} - Received", + "refId": "C" + }, + { + "expr": "coredns_dns_request_duration_seconds_count{namespace=\"$namespace\", pod=\"$pod\", cluster=\"$cluster\"}", + "legendFormat": "DNS Requests", + "refId": "D" + }observability/grafana/dashboards/vmauth.json (2)
58-58
: Consider adding version compatibility information.The description mentions "v1.80.0 or higher" but doesn't specify the maximum compatible version.
Add version compatibility range to the description:
- "description": "Overview for VictoriaMetrics vmauth v1.80.0 or higher", + "description": "Overview for VictoriaMetrics vmauth v1.80.0 - v2.x.x",
2461-2564
: Consider adding more granular templating variables.The current templating configuration could be enhanced for better filtering capabilities.
Add these templating variables:
+"adhoc_filters": { + "name": "Custom Filters", + "type": "custom", + "includeAll": true, + "multi": true +}, +"status": { + "name": "Status", + "type": "custom", + "includeAll": true, + "options": [ + "success", + "error", + "unauthorized" + ] +}observability/grafana/dashboards/namespace-overview.json (2)
24-24
: Update dashboard description for better maintainability.The description references an external GitHub repository which may become outdated.
- "description": "This is a modern 'Namespaces View' dashboard for your Kubernetes cluster(s). Made for kube-prometheus-stack and take advantage of the latest Grafana features. GitHub repository: https://github.com/dotdc/grafana-dashboards-kubernetes", + "description": "Modern 'Namespaces View' dashboard for Kubernetes cluster(s). Compatible with kube-prometheus-stack and latest Grafana features. Monitors namespace-level metrics including resource usage, pod status, and network activity.",
1252-1253
: Enhance OOM events monitoring query.The current query for OOM events could be improved to provide more context.
- "expr": "sum(increase(container_oom_events_total{namespace=~\"${namespace}\", cluster=\"$cluster\"}[$__rate_interval])) by (namespace, pod) > 0", + "expr": "sum(increase(container_oom_events_total{namespace=~\"${namespace}\", cluster=\"$cluster\"}[$__rate_interval])) by (namespace, pod, container) > 0",This change adds container-level granularity to help identify specific containers causing OOM events.
argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml (1)
8-8
: Add a newline at the end of the file.Add a newline character at the end of the file to comply with YAML style requirements.
- string-secret.yaml +
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 8-8: no new line character at the end of file
(new-line-at-end-of-file)
observability/rules/containermemoryrss.rules.yml (1)
1-21
: LGTM! Well-structured monitoring rule for container RSS memory.The implementation correctly captures container RSS memory metrics with proper label correlation and multi-cluster support.
Consider adding annotations to document:
- Purpose of the rule
- Expected values/thresholds
- Related dashboards/alerts
Example:
- annotations: {} + annotations: + description: "Records RSS memory usage per container, correlated with node information" + dashboard: "https://grafana/d/k8s-resources-pod"observability/rules/containermemoryswap.rules.yml (1)
1-21
: LGTM! Verify swap metric availability in your clusters.The implementation follows the same well-structured pattern as the RSS rule. However, swap metrics availability should be verified.
Please verify that swap metrics are available in your clusters by running:
Consider adding annotations to document:
- Purpose of the rule
- Expected values/thresholds
- Related dashboards/alerts
- Whether swap is expected/supported in your environment
Example:
- annotations: {} + annotations: + description: "Records swap memory usage per container, correlated with node information" + dashboard: "https://grafana/d/k8s-resources-pod" + note: "Requires swap to be enabled on nodes"❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check if swap metrics are being collected # Query kubelet metrics to verify swap metric existence curl -s http://localhost:8001/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')/proxy/metrics/cadvisor | grep container_memory_swap # Check if swap is enabled on nodes kubectl get nodes -o json | jq -r '.items[] | .metadata.name' | while read node; do echo "Node: $node" kubectl debug node/$node -it --image=busybox -- swapon --show doneLength of output: 430
Swap Metric Verification Requires Manual Checks
The script did not produce output due to missingkubectl
in the current environment. Please verify manually in your cluster that swap metrics are collected and that the nodes have swap enabled.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (100)
.github/workflows/gcp-full-infra-test-apply.yaml
(0 hunks).github/workflows/gcp-full-infra-test-plan.yaml
(0 hunks).github/workflows/testing-apply-aws-infra.yaml
(0 hunks).github/workflows/testing-plan-aws-infra.yaml
(0 hunks)README.md
(1 hunks)argocd/app_plane/README
(1 hunks)argocd/app_plane/dev/alloy.yaml
(1 hunks)argocd/app_plane/dev/applicationset.yaml
(1 hunks)argocd/app_plane/dev/grafana-dashboards.yaml
(1 hunks)argocd/app_plane/dev/grafana-operator.yaml
(1 hunks)argocd/app_plane/dev/ksm.yaml
(1 hunks)argocd/app_plane/dev/manifests/grafana-datasource.yaml
(1 hunks)argocd/app_plane/dev/manifests/grafana.yaml
(1 hunks)argocd/app_plane/dev/manifests/podmonitor.yaml
(1 hunks)argocd/app_plane/dev/victoriametrics.yaml
(1 hunks)argocd/app_plane/dev/vmrules.yaml
(1 hunks)argocd/ctrl_plane/dev/alloy.yaml
(1 hunks)argocd/ctrl_plane/dev/certmanager.yaml
(1 hunks)argocd/ctrl_plane/dev/grafana-dashboards.yaml
(1 hunks)argocd/ctrl_plane/dev/grafana-operator.yaml
(1 hunks)argocd/ctrl_plane/dev/ksm.yaml
(1 hunks)argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml
(1 hunks)argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml
(1 hunks)argocd/ctrl_plane/dev/kustomize/vmuser/string-secret.yaml
(1 hunks)argocd/ctrl_plane/dev/kustomize/vmuser/vmuser.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/grafana-managed-certificate.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/grafana.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/letsencrypt-issuer.yaml
(1 hunks)argocd/ctrl_plane/dev/manifests/vmauth-tls.yaml
(1 hunks)argocd/ctrl_plane/dev/victoriametrics.yaml
(1 hunks)argocd/ctrl_plane/dev/vmrules.yaml
(1 hunks)argocd/ctrl_plane/dev/vmusers.yaml
(1 hunks)observability/grafana/base/grafana-dashboards.yaml
(1 hunks)observability/grafana/base/kustomization.yaml
(1 hunks)observability/grafana/dashboards/falkordb-cloud.json
(1 hunks)observability/grafana/dashboards/namespace-overview.json
(1 hunks)observability/grafana/dashboards/pod-overview.json
(1 hunks)observability/grafana/dashboards/prometheus-dashboard.json
(1 hunks)observability/grafana/dashboards/vmauth.json
(1 hunks)observability/grafana/kustomization.yaml
(1 hunks)observability/rules/alertmanager.rules.yml
(1 hunks)observability/rules/containercpulimits.rules.yml
(1 hunks)observability/rules/containercpurequests.rules.yml
(1 hunks)observability/rules/containercpuusagesecondstotal.rules.yml
(1 hunks)observability/rules/containermemorycache.rules.yml
(1 hunks)observability/rules/containermemorylimits.rules.yml
(1 hunks)observability/rules/containermemoryrequests.rules.yml
(1 hunks)observability/rules/containermemoryrss.rules.yml
(1 hunks)observability/rules/containermemoryswap.rules.yml
(1 hunks)observability/rules/containermemoryworkingsetbytes.rules.yml
(1 hunks)observability/rules/general.rules.yml
(1 hunks)observability/rules/kube-state-metrics.rules.yml
(1 hunks)observability/rules/kubernetes-apps.rules.yml
(1 hunks)observability/rules/kubernetes-resources.rules.yml
(1 hunks)observability/rules/kubernetes-storage.rules.yml
(1 hunks)observability/rules/node-exporter.rules.yml
(1 hunks)observability/rules/node-network.rules.yml
(1 hunks)observability/rules/node.rules.yml
(1 hunks)observability/rules/podowner.rules.yml
(1 hunks)observability/rules/prometheus-general.rules.yml
(1 hunks)observability/rules/prometheus-node-recording.rules.yml
(1 hunks)observability/rules/vm-health.rules.yml
(1 hunks)observability/rules/vmoperator.rules.yml
(1 hunks)observability/rules/vmsingle.rules.yml
(1 hunks)scripts/add_cluster.sh
(1 hunks)tofu/aws/.terraform.lock.hcl
(0 hunks)tofu/aws/1-bootstrap/.terraform.lock.hcl
(1 hunks)tofu/aws/1-bootstrap/backend.tf
(1 hunks)tofu/aws/1-bootstrap/main.tf
(1 hunks)tofu/aws/1-bootstrap/outputs.tf
(1 hunks)tofu/aws/1-bootstrap/providers.tf
(1 hunks)tofu/aws/1-bootstrap/variables.tf
(1 hunks)tofu/aws/2-org/.terraform.lock.hcl
(1 hunks)tofu/aws/2-org/backend.tf
(1 hunks)tofu/aws/2-org/main.tf
(1 hunks)tofu/aws/2-org/variables.tf
(1 hunks)tofu/aws/3-application_plane/.terraform.lock.hcl
(1 hunks)tofu/aws/3-application_plane/backend.tf
(1 hunks)tofu/aws/3-application_plane/main.tf
(1 hunks)tofu/aws/3-application_plane/variables.tf
(1 hunks)tofu/aws/main.tf
(0 hunks)tofu/aws/outputs.tf
(0 hunks)tofu/aws/variables.tf
(0 hunks)tofu/gcp/bootstrap/seed_project/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/README
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/.terraform.lock.hcl
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/backend.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/outputs.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/providers.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/variables.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/.terraform.lock.hcl
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/backend.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/providers.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/values/dev/argocd.yaml
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/values/prod/argocd.yaml
(1 hunks)tofu/gcp/observability_stack/control_plane/k8s/variables.tf
(1 hunks)
💤 Files with no reviewable changes (8)
- .github/workflows/gcp-full-infra-test-plan.yaml
- .github/workflows/testing-apply-aws-infra.yaml
- tofu/aws/.terraform.lock.hcl
- .github/workflows/testing-plan-aws-infra.yaml
- .github/workflows/gcp-full-infra-test-apply.yaml
- tofu/aws/variables.tf
- tofu/aws/outputs.tf
- tofu/aws/main.tf
✅ Files skipped from review due to trivial changes (13)
- tofu/aws/3-application_plane/backend.tf
- tofu/gcp/observability_stack/control_plane/infra/providers.tf
- tofu/aws/2-org/backend.tf
- tofu/gcp/bootstrap/seed_project/main.tf
- argocd/ctrl_plane/dev/manifests/grafana-managed-certificate.yaml
- tofu/gcp/observability_stack/control_plane/k8s/backend.tf
- tofu/gcp/observability_stack/control_plane/README
- tofu/aws/1-bootstrap/.terraform.lock.hcl
- argocd/app_plane/README
- tofu/aws/2-org/.terraform.lock.hcl
- tofu/aws/3-application_plane/.terraform.lock.hcl
- tofu/gcp/observability_stack/control_plane/infra/.terraform.lock.hcl
- tofu/gcp/observability_stack/control_plane/k8s/.terraform.lock.hcl
🧰 Additional context used
🪛 YAMLlint (1.35.1)
observability/grafana/base/kustomization.yaml
[error] 4-4: trailing spaces
(trailing-spaces)
[error] 5-5: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/dev/manifests/grafana-datasource.yaml
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/manifests/vmauth-tls.yaml
[error] 9-9: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml
[warning] 14-14: wrong indentation: expected 8 but found 10
(indentation)
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/dev/victoriametrics.yaml
[error] 136-136: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/certmanager.yaml
[error] 25-25: no new line character at the end of file
(new-line-at-end-of-file)
argocd/app_plane/dev/alloy.yaml
[error] 38-38: trailing spaces
(trailing-spaces)
[error] 43-43: trailing spaces
(trailing-spaces)
argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml
[error] 20-20: no new line character at the end of file
(new-line-at-end-of-file)
observability/grafana/base/grafana-dashboards.yaml
[error] 72-72: no new line character at the end of file
(new-line-at-end-of-file)
argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml
[error] 8-8: no new line character at the end of file
(new-line-at-end-of-file)
🪛 LanguageTool
README.md
[style] ~21-~21: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ade.sh: Upgrade OpenTofu dependency. 4. tofu_new_workspace.sh: Create workspace for ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...sh: Create workspace for deployment. 5. tofu_select_workspace.sh: Switch between wor...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~23-~23: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: Switch between workspaces. 6. tofu_delete_workspace.sh: Delete workspace. ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~24-~24: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...lete_workspace.sh: Delete workspace. 7. tofu_list_workspace.sh: List available works...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~24-~24: Possible missing preposition found.
Context: ...kspace. 7. tofu_list_workspace.sh: List available workspaces. 8. tofu_plan.sh: Generate e...
(AI_HYDRA_LEO_MISSING_OF)
[style] ~25-~25: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: List available workspaces. 8. tofu_plan.sh: Generate execution plan to be ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~26-~26: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...erate execution plan to be deployed. 9. tofu_plan_aws.sh: Generate execution plan to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~27-~27: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the AWS target. 10. tofu_plan_k8s.sh: Generate execution plan to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~28-~28: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the K8S target. 11. tofu_apply.sh: Deploy the plan to the cloud ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~29-~29: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...loy the plan to the cloud provider. 12. tofu_apply_aws.sh: Deploy the AWS target to ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e AWS target to the cloud provider. 13. tofu_apply_k8s.sh: Deploy the K8S target to ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~31-~31: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e K8S target to the cloud provider. 14. tofu_destroy.sh: Delete the deployment from ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~32-~32: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...deployment from the cloud provider. 15. tofu_output.sh: Show deployment output. 16. ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~33-~33: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ..._output.sh: Show deployment output. 16. tofu_show.sh: Show the state configuration. ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w.sh: Show the state configuration. 17. tofu_test.sh: Run Tofu tests. 18. aws_update...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... Port forward into the grafana gui. 21. kubectl_connect_prometheus.sh: Port forward int...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rt forward into the prometheus gui. 22. kubectl_connect_alertmanager.sh: Port forward i...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 Shellcheck (0.10.0)
scripts/add_cluster.sh
[warning] 23-23: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 24-24: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 25-25: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 28-28: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 29-29: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 30-30: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 31-31: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 33-33: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 34-34: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 35-35: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 36-36: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 37-37: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 107-107: Quote this to prevent word splitting.
(SC2046)
[warning] 148-148: Quote this to prevent word splitting.
(SC2046)
🪛 Checkov (3.2.334)
tofu/gcp/observability_stack/control_plane/infra/main.tf
[HIGH] 191-225: Ensure public access prevention is enforced on Cloud Storage bucket
(CKV_GCP_114)
tofu/aws/3-application_plane/main.tf
[HIGH] 59-70: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
[HIGH] 95-106: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
🔇 Additional comments (49)
observability/rules/podowner.rules.yml (2)
1-12
: LGTM! Well-structured VMRule configuration.The metadata structure follows Kubernetes best practices, with clear namespace segregation and consistent naming conventions.
31-63
: LGTM! Well-implemented workload type rules.The DaemonSet, StatefulSet, and Job rules follow a consistent pattern, making them maintainable and easy to understand. The rules correctly track pod ownership for all major Kubernetes workload types.
argocd/ctrl_plane/dev/victoriametrics.yaml (7)
16-29
: LGTM: Well-configured dashboard settings.The dashboard configuration is well structured with:
- Server-side apply enabled for better conflict handling
- Cross-namespace imports disabled for better security
- Proper instance selector for Grafana integration
101-124
: LGTM: Proper component disablement.Correctly disabled redundant components that are likely managed elsewhere in the infrastructure:
- Grafana (managed separately)
- Node exporter
- Various Kubernetes metric collectors
128-134
: LGTM: Well-configured sync policy.The sync policy is properly configured with:
- Automatic namespace creation
- Pruning enabled for clean state
- Self-healing for automatic recovery
135-153
: LGTM: Comprehensive ignore differences configuration.Properly configured ignore differences for:
- Operator validation secret
- Webhook configuration
- Grafana deployment checksum
These configurations will prevent ArgoCD from constantly trying to sync dynamic fields.
53-81
: Review PagerDuty integration security.The Alertmanager configuration includes PagerDuty integration with some security considerations:
- Secret management is properly configured
- Alert grouping and timing parameters are well defined
Ensure the PagerDuty service key is properly stored as a Kubernetes secret:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if the PagerDuty secret exists in the cluster kubectl get secret pagerduty-service-key -n observabilityLength of output: 157
PagerDuty Secret Verification Requires Manual Confirmation
The test command couldn’t run as expected because the environment didn’t havekubectl
available. Please manually verify the PagerDuty secret in the cluster using an appropriate environment (or ensurekubectl
is installed) with the following command:kubectl get secret pagerduty-service-key -n observability
85-101
: Review VMAuth ingress security configuration.The VMAuth ingress configuration has important security implications:
- TLS is properly enforced with cert-manager
- HTTP is correctly disabled
- Internal domain is used
Verify the cert-manager issuer exists:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if the specified cert-manager issuer exists kubectl get issuer letsencrypt -n observabilityLength of output: 147
Manual Verification Required: Missing
kubectl
Command in EnvironmentThe cert-manager issuer check couldn't run as expected because the environment lacks the
kubectl
command. Please run the verification script in an environment withkubectl
installed or manually verify that the issuer "letsencrypt" exists in the "observability" namespace.
9-11
: Verify the Helm chart version.The configuration uses victoria-metrics-k8s-stack chart version 0.35.1. Let's verify if this is the latest stable version and check for any known security issues.
✅ Verification successful
🌐 Web query:
What is the latest stable version of victoria-metrics-k8s-stack Helm chart? Are there any known security issues with version 0.35.1?
💡 Result:
The latest stable version of the
victoria-metrics-k8s-stack
Helm chart is 0.36.0, released on February 7, 2025[8]. This version includes updates to the VictoriaMetrics Operator (v0.53.0)[8].For version 0.35.1, there are no direct mentions of security issues in the provided sources:
- Version 0.35.x series (including 0.35.7) primarily addressed configuration improvements and dependency updates[8]
- Known issues with chart removal (reported in earlier versions) were related to Helm hook execution order rather than security vulnerabilities[2][5]
However, important operational considerations exist for all versions:
- Uninstallation might leave CRDs requiring manual cleanup[5]
- ArgoCD deployments require special sync configurations to prevent certificate regeneration[5]
- Namespace deletion workflows need careful ordering when using FluxCD[2][5]
No CVEs or explicit security advisories appear in the documentation or release notes for version 0.35.1[1-10].
Citations:
- 1: https://docs.victoriametrics.com/operator/setup/
- 2: bug: removal of VM k8s chart does not work well. VictoriaMetrics/helm-charts#1563
- 3: https://docs.victoriametrics.com/helm/
- 4: https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-k8s-stack/README.md
- 5: https://docs.victoriametrics.com/helm/victoriametrics-k8s-stack/
- 6: https://newreleases.io/project/github/VictoriaMetrics/helm-charts/release/victoria-metrics-k8s-stack-0.23.3
- 7: https://docs.victoriametrics.com/helm/victoria-metrics-k8s-stack/release_guide/
- 8: https://github.com/VictoriaMetrics/helm-charts/releases
- 9: https://newreleases.io/project/github/VictoriaMetrics/helm-charts/release/victoria-metrics-k8s-stack-0.22.0
- 10: https://docs.victoriametrics.com/helm/victoria-metrics-k8s-stack/changelog/
Helm Chart Version Verification: Consider Upgrading
- The latest stable version of the
victoria-metrics-k8s-stack
Helm chart is 0.36.0 (released February 7, 2025).- The current configuration pins version 0.35.1, which does not have any known security issues.
- While version 0.35.1 is acceptable from a security standpoint, upgrading to 0.36.0 may provide additional improvements and updates.
argocd/ctrl_plane/dev/alloy.yaml (3)
1-13
: LGTM! Well-structured ApplicationSet configuration.The ApplicationSet is correctly configured with Go template support and appropriate cluster selector for control plane deployments.
297-302
: LGTM! Robust sync policy configuration.The sync policy is well configured with:
- Automatic namespace creation
- Server-side apply for better conflict handling
- Automated pruning and self-healing
20-22
: Verify the Helm chart version.The chart version is pinned to 0.11.0. Please verify if this is the latest stable version that meets your requirements.
✅ Verification successful
🌐 Web query:
What is the latest version of the Grafana Alloy Helm chart?
💡 Result:
The latest version of the Grafana Alloy Helm chart can be determined from multiple sources:
GitHub Releases ([Source 4]) show active development:
alloy-0.11.x
appears as part of dependency updates for other components like Tempo.- A more recent
k8s-monitoring-2.x
series (e.g.,2.0.x
) was released for Kubernetes Monitoring integration with Alloy ([Source 3]).Documentation ([Source 1][5][9]) consistently references configuration steps for recent versions without specifying exact numbers but implies compatibility with ongoing updates like clustering features ([Source 5]).
3 Release Notes ([Source 7]) detail breaking changes up to Alloy v1.x, though these refer to application versions rather than Helm charts specifically[^7].
Based on available data:
The most explicitly referenced stable release isalloy-0.x
, while newer integrations (like Kubernetes Monitoring) usek8s-monitoring-2.x
charts with built-in support for modern Alloy configurations1[^4]. For precise deployment versions today:helm search repo grafana/alloywould return current listings directly from repositories[^9].
Citations:
- 1: https://grafana.com/docs/alloy/latest/configure/kubernetes/
- 2: https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/release-notes/
- 3: https://grafana.com/blog/2025/01/23/kubernetes-monitoring-helm-chart-2.0-a-simpler-more-predictable-experience/
- 4: https://github.com/grafana/helm-charts/releases
- 5: https://grafana.com/docs/alloy/latest/configure/clustering/
- 6: https://community.grafana.com/c/grafana-alloy/69
- 7: https://grafana.com/docs/alloy/latest/release-notes/
- 8: https://grafana.com/docs/beyla/latest/setup/helm-alloy/
- 9: https://grafana.com/docs/helm-charts/
- 10: https://github.com/grafana/alloy/blob/main/operations/helm/charts/alloy/values.yaml
Helm Chart Version Verification
Based on our review of available sources, the pinned version
0.11.0
is part of the stablealloy-0.x
series referenced by Grafana. While newer charts (such as thek8s-monitoring-2.x
series) exist for different integrations, for deployments using the standard Alloy chart, version0.11.0
remains appropriate.tofu/aws/1-bootstrap/outputs.tf (1)
1-3
: LGTM! Output variable properly defined.The output variable is correctly defined to expose the AWS Organizational Unit ID, which can be referenced by other Terraform modules.
argocd/ctrl_plane/dev/kustomize/vmuser/string-secret.yaml (1)
1-11
: Consider API stability implications.The configuration uses an alpha version API (
secretgenerator.mittwald.de/v1alpha1
). While the implementation looks correct, alpha APIs may have breaking changes.The password configuration with 16 characters and base64 encoding is appropriate for security requirements.
✅ Verification successful
🌐 Web query:
What is the latest stable version of mittwald's secret-generator operator?
💡 Result:
The latest stable version of mittwald's Kubernetes Secret Generator operator is v3.4.0, released on April 27, 2022[1][10]. This version includes several fixes and improvements such as PR closure prevention in stale actions and CRD controller encoding adjustments[1].
Key details:
- Helm chart deployment: Available via
mittwald/kubernetes-secret-generator
Helm repository[5][9]- Docker images: Tagged as
quay.io/mittwald/kubernetes-secret-generator:v3.4.0
[1]- Features: Supports annotation-based secret generation for passwords, SSH keys, and basic auth credentials with configurable encoding/length[5][9]
While older versions like v2.1.2 exist[7], v3.x remains the current stable branch according to both GitHub releases and Artifact Hub listings[1][9][10]. The operator continues to be maintained with security-focused random string generation using cryptographically secure methods[5][6].
Citations:
- 1: https://github.com/mittwald/kubernetes-secret-generator/releases
- 2: https://pkg.go.dev/github.com/mittwald/harbor-operator/pkg/apis/v1alpha2
- 3: https://support.tools/mittwald-kubernetes-secret-generator/
- 4: https://registry.terraform.io/providers/metio/k8s/latest/docs/data-sources/camel_apache_org_build_v1_manifest
- 5: https://github.com/mittwald/kubernetes-secret-generator
- 6: https://pkg.go.dev/github.com/mittwald/kubernetes-secret-generator
- 7: https://newreleases.io/project/github/mittwald/kubernetes-secret-generator/release/v2.1.2
- 8: https://registry.hub.docker.com/r/littlefox/k8s-secret-admin
- 9: https://artifacthub.io/packages/kcl/kcl-module/kubernetes-secret-generator
- 10: https://artifacthub.io/packages/helm/kubernetes-replicator/kubernetes-secret-generator
API Stability Consideration:
The configuration is correctly implemented, generating a base64-encoded password with a specified 16-character length. However, it relies on the alpha API (secretgenerator.mittwald.de/v1alpha1
), which inherently carries risks associated with potential breaking changes. The latest stable release of the operator is v3.4.0, yet this does not automatically translate into a stable CRD API. While the current setup is acceptable, please keep an eye on future releases—if a stable CRD version becomes available, consider updating the API version accordingly.tofu/gcp/observability_stack/control_plane/infra/outputs.tf (1)
1-17
: LGTM! Well-structured outputs with proper security considerations.The outputs are well-defined with:
- Appropriate sensitive flags for cluster credentials
- Clear naming conventions
- Essential cluster access information
argocd/app_plane/dev/manifests/podmonitor.yaml (1)
8-10
: Verify the empty bearerTokenSecret configuration.The
bearerTokenSecret
configuration has empty values for bothkey
andname
. If authentication is required for accessing the metrics endpoint, please ensure these values are properly configured.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check if any other PodMonitors in the cluster use bearerTokenSecret # to understand if this is a common pattern or if authentication is typically required. ast-grep --pattern $'kind: PodMonitor { $$$ spec { $$$ podMetricsEndpoints { $$$ bearerTokenSecret { $$$ } $$$ } $$$ } }'Length of output: 176
Below is an additional check to search for any usage of
bearerTokenSecret
in the repository. If other configurations are found, they might shed light on whether non-empty values are standard in our PodMonitor definitions:If this returns no other instances with values configured (or if you expect some, please verify manually), it suggests that this empty configuration might be intentional—perhaps because the monitored endpoint does not require authentication. However, if the endpoint does require secure access, proper values (for both the
key
andname
) must be provided.Manual Verification Needed for bearerTokenSecret Settings
- The argocd/app_plane/dev/manifests/podmonitor.yaml file contains an empty configuration for
bearerTokenSecret
.- No other instances with non-empty values were found via our search.
- Please confirm whether the monitored metrics endpoint is expected to have authentication. If authentication is needed, update the configuration with the proper secret details.
🏁 Script executed:
#!/bin/bash # Search for all occurrences of bearerTokenSecret in YAML files rg "bearerTokenSecret:" -n .Length of output: 104
Manual Verification Required: Confirm bearerTokenSecret Configuration
- The search indicates that the only occurrence of
bearerTokenSecret
in our repository is in the fileargocd/app_plane/dev/manifests/podmonitor.yaml
with empty values.- No evidence was found of alternative configurations or non-empty values elsewhere in the codebase.
- Please verify whether the metrics endpoint indeed requires bearer token authentication. If authentication is required, the secret’s
key
andname
should be correctly specified.argocd/ctrl_plane/dev/manifests/letsencrypt-issuer.yaml (1)
9-9
: Note: Using Let's Encrypt staging environment.The configuration is currently using the staging ACME server URL. This is good for testing but will issue untrusted certificates. Remember to switch to the production URL before deploying to production:
- server: https://acme-staging-v02.api.letsencrypt.org/directory + server: https://acme-v02.api.letsencrypt.org/directorytofu/aws/2-org/variables.tf (1)
1-19
: LGTM!The variables are well-defined with appropriate types and clear descriptions. The naming convention follows Terraform best practices.
argocd/ctrl_plane/dev/manifests/grafana-datasource.yaml (1)
16-17
: Review TLS verification settings.The configuration has
tlsSkipVerify: true
, which bypasses TLS certificate verification. This could pose a security risk. Consider implementing proper TLS certificates if this is a production environment.argocd/ctrl_plane/dev/vmrules.yaml (1)
1-22
: LGTM! Verify the Git repository access.The ArgoCD Application configuration for VictoriaMetrics rules is well-structured. The sync policy with automated pruning and self-healing will ensure consistent rule management.
Run the following script to verify the Git repository access:
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Description: Verify Git repository access and branch existence. # Test: Check if the repository is accessible and the branch exists gh repo view FalkorDB/falkordb-dbaas gh api repos/FalkorDB/falkordb-dbaas/branches/dev --jq '.name'Length of output: 2649
Repository Git Access Verified
The Git repository "FalkorDB/falkordb-dbaas" is accessible and the branch "dev" exists as expected.
argocd/ctrl_plane/dev/grafana-dashboards.yaml (1)
1-21
: LGTM! Well-structured configuration for Grafana dashboards.The ArgoCD Application configuration is properly set up with server-side apply, which is the recommended approach for managing Grafana dashboards to handle potential conflicts.
argocd/app_plane/dev/manifests/grafana-datasource.yaml (1)
16-17
: Review TLS verification settings.The datasource is configured to skip TLS verification (
tlsSkipVerify: true
). This could pose security risks in production environments.Please confirm if this is intentional and if there are plans to implement proper TLS certificates.
argocd/ctrl_plane/dev/certmanager.yaml (1)
14-14
: Verify cert-manager version compatibilityThe specified version v1.17.0 seems unusual. Let's verify the latest stable version and its compatibility with your Kubernetes cluster.
✅ Verification successful
🌐 Web query:
What is the latest stable version of cert-manager?
💡 Result:
The latest stable version of cert-manager is v1.17.0, released on February 6, 202523. This version includes:
- Compliance improvements for RSA signatures
- Simplified password handling for PKCS#12/JKS keystores
- Dependency updates addressing CVEs like CVE-2024-45337/453382
- Memory optimizations and Prometheus metrics enhancements[^4]
Key supporting evidence:
- Official GitHub releases show v1.17.0 as the latest2
- Helm charts list this version as current on Artifact Hub3
- Recent dependency bumps confirm active maintenance24
For enterprise environments:
- Red Hat OpenShift supports up to v1.13.6 via their certified operator1
- Rancher recommends v1.7.x but allows direct upgrades to newer versions5
Citations:
- 1: https://github.com/cert-manager/cert-manager/releases
- 2: https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/resources/upgrade-cert-manager
- 3: https://docs.redhat.com/en/documentation/openshift_container_platform/4.13/html/security_and_compliance/cert-manager-operator-for-red-hat-openshift
- 4: https://cert-manager.io
- 5: https://artifacthub.io/packages/helm/cert-manager/cert-manager
- 6: https://github.com/cert-manager/cert-manager/actions/runs/7117703203
- 7: https://docs.openshift.com/container-platform/4.10/security/cert_manager_operator/cert-manager-operator-release-notes.html
- 8: https://cert-manager.io/v1.1-docs/installation/kubernetes/
- 9: https://cert-manager.io/docs/installation/helm/
Cert-manager version is correctly set
The latest stable version is indeed v1.17.0 according to the web query details. No changes are needed.
argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml (1)
11-11
: Verify kubernetes-secret-generator version compatibilityLet's verify if version 3.4.0 is the latest stable version available.
✅ Verification successful
🌐 Web query:
What is the latest version of kubernetes-secret-generator Helm chart?
💡 Result:
The latest version of the
kubernetes-secret-generator
Helm chart found in search results is 3.4.0, as listed on Artifact Hub under thekubernetes-replicator
repository[6]. This information was current as of February 2025 based on available documentation.For deployment reference:
helm repo add kubernetes-replicator https://kubernetes-replicator.github.io/helm-charts helm install kubernetes-secret-generator kubernetes-replicator/kubernetes-secret-generator --version 3.4.0Note there's another implementation by mittwald (GitHub) that uses different versioning[3], but its documentation doesn't explicitly state a Helm chart version number in search results[3][6].
Citations:
- 1: https://github.com/SAP/secret-generator
- 2: https://gitlab.jonasled.de/jonasled/kubernetes-secret-generator/-/tree/v2.1.2
- 3: https://github.com/mittwald/kubernetes-secret-generator
- 4: https://helm.sh/docs/topics/charts/
- 5: https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kustomize/
- 6: https://artifacthub.io/packages/helm/kubernetes-replicator/kubernetes-secret-generator
- 7: https://artifacthub.io/packages/helm/kubernetes-replicator/kubernetes-secret-generator/3.1.0
- 8: https://kubernetes.io/docs/concepts/configuration/secret/
Version Check: No update required.
- The latest version of the
kubernetes-secret-generator
Helm chart is confirmed to be 3.4.0.- The YAML file at
argocd/ctrl_plane/dev/kubernetes-secret-generator.yaml
correctly specifiestargetRevision: 3.4.0
.argocd/ctrl_plane/dev/grafana-operator.yaml (2)
22-23
: LGTM! Good use of ServerSideApplyThe ServerSideApply option is a good choice for managing complex resources and reducing conflicts.
10-11
: Verify Helm repository URL format and versionThe repository URL
ghcr.io/grafana/helm-charts
might need to be prefixed withoci://
. Also, let's verify the latest stable version.✅ Verification successful
🌐 Web query:
What is the correct Helm repository URL format for ghcr.io and the latest version of grafana-operator?
💡 Result:
The correct Helm repository URL format for installing Grafana Operator from GitHub Container Registry (GHCR) using OCI registry format is:
OCI-based Helm chart URL:
oci://ghcr.io/grafana/helm-charts/grafana-operator
Latest verified version (as of July 2024):
v5.16.0
Example installation command combining both:
helm upgrade -i grafana-operator oci://ghcr.io/grafana/helm-charts/grafana-operator --version v5.16.0Key details from official sources:
- The OCI registry format requires using
oci://
protocol prefix[2][6][9]- GHCR path follows standard container registry patterns:
ghcr.io/{ORG}/{REPO}/{CHART}
[2][7][10]- Version pinning is recommended using
--version
flag[2][5][9]- CRDs must be updated separately before upgrades[2][5]
For non-OCI installations (traditional Helm repos), use:
helm repo add grafana https://grafana.github.io/helm-chartsBut this method doesn't apply to GHCR-hosted charts[3][8].
Citations:
- 1: https://paulyu.dev/article/publishing-helm-charts-to-ghcr/
- 2: https://grafana.github.io/grafana-operator/docs/installation/helm/
- 3: https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/
- 4: https://fluxcd.io/flux/components/source/helmrepositories/
- 5: https://github.com/grafana/grafana-operator/releases
- 6: https://grafana.com/blog/2024/04/24/how-to-use-the-grafana-operator-managing-a-grafana-cloud-stack-in-kubernetes/
- 7: https://trstringer.com/helm-charts-github-container-registry/
- 8: https://grafana.com/docs/agent/latest/operator/helm-getting-started/
- 9: https://grafana.github.io/grafana-operator/docs/quick-start/
- 10: https://niklasmtj.de/blog/use-ghcr-to-host-helm-charts/
Action Required: Update the Helm Repository URL
- The correct OCI-based URL for installing Grafana Operator is:
oci://ghcr.io/grafana/helm-charts/grafana-operator
- The latest version, v5.16.0, is correctly specified.
- Recommendation: Update the
repoURL
inargocd/ctrl_plane/dev/grafana-operator.yaml
fromghcr.io/grafana/helm-charts
to include the OCI prefix and chart name.argocd/ctrl_plane/dev/manifests.yaml (2)
9-11
: Consider using a stable branch for production deployments.The source configuration currently targets the
dev
branch. For production environments, consider using a stable branch, tag, or commit SHA to ensure consistent and reliable deployments.
1-29
: LGTM! Well-structured ArgoCD Application configuration.The configuration follows best practices:
- Automated sync with prune and self-heal for proper GitOps workflow
- Appropriate namespace creation and Secret handling
- Clear destination and source specifications
argocd/app_plane/dev/grafana-operator.yaml (1)
20-22
: Verify Grafana operator version and repository URL.The configuration looks good, but let's verify the latest stable version and ensure GHCR is the recommended source.
✅ Verification successful
🌐 Web query:
What is the latest stable version of the Grafana operator Helm chart? Is GHCR the recommended source?
💡 Result:
The latest stable version of the Grafana Operator Helm chart is v5.16.0, available through GitHub Container Registry (GHCR) as shown in installation commands across multiple sources[2][5][10]. This version includes support for Grafana 11.3.0, notification templates, and enhanced TLS secret management[10].
GHCR is explicitly recommended as the source for installation:
helm upgrade -i grafana-operator oci://ghcr.io/grafana/helm-charts/grafana-operator --version v5.16.0This OCI-based installation method is highlighted as standard in both quickstart guides[3][5] and Helm chart documentation[2][7]. The Grafana team maintains these charts directly in GHCR, with explicit instructions to use this registry instead of traditional Helm repos[2][7].
Key advantages of using GHCR:
- Direct access to officially maintained charts
- Version-pinned deployments via OCI artifacts
- Simplified Terraform integration using
oci://
URLs[2]For production environments, always reference specific versions like v5.16.0 rather than using floating tags like
latest
[2][10].Citations:
- 1: https://grafana.com/docs/agent/latest/operator/helm-getting-started/
- 2: https://grafana.github.io/grafana-operator/docs/installation/helm/
- 3: https://grafana.github.io/grafana-operator/docs/quick-start/
- 4: https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/
- 5: https://grafana.com/blog/2024/04/24/how-to-use-the-grafana-operator-managing-a-grafana-cloud-stack-in-kubernetes/
- 6: https://helmfile.readthedocs.io/en/latest/
- 7: https://github.com/grafana/grafana-operator/pkgs/container/helm-charts%2Fgrafana-operator
- 8: https://github.com/grafana/grafana-operator/actions/runs/7784001464/workflow
- 9: https://operatorhub.io/operator/grafana-operator
- 10: https://github.com/grafana/grafana-operator/releases
Grafana Operator configuration verified
The latest stable version is confirmed as v5.16.0 and GHCR remains the recommended source for the Helm chart. The configuration in
argocd/app_plane/dev/grafana-operator.yaml
is correct as-is.argocd/app_plane/dev/grafana-dashboards.yaml (2)
1-44
: Well-structured ApplicationSet configuration!The configuration follows best practices:
- Uses Go template with strict error handling
- Implements automated sync with pruning and self-healing
- Includes server-side apply for better conflict resolution
- Creates namespace if missing
16-16
: Verify the regex pattern for cluster name extraction.The regex pattern
h?c-[A-Za-z0-9]+
assumes a specific naming convention. Ensure this pattern matches all your cluster names to prevent application generation failures.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Verify cluster name patterns match the regex. # Get cluster names and validate against the regex pattern kubectl get clusters -n argocd -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \ grep -vE "h?c-[A-Za-z0-9]+" || echo "All cluster names match the pattern"Length of output: 300
Action Required: Re-run the Cluster Name Check in a Proper Environment
The current verification did not truly validate the regex because the
kubectl
command wasn’t executed (it returned "command not found"). As a result, the message “All cluster names match the pattern” was printed by the fallback and does not confirm that your actual cluster names conform to the pattern.
- Ensure that the script is executed in an environment where
kubectl
is available.- Verify the cluster naming convention (i.e., whether clusters always start with "hc-" or "c-") matches the intended regex definition.
- Consider testing with a locally prepared sample of cluster names if setting up a proper Kubernetes environment isn’t immediately feasible.
observability/rules/containercpulimits.rules.yml (1)
1-34
: Well-designed recording rules for CPU limits monitoring!The implementation follows PromQL best practices:
- Proper metric naming convention
- Efficient use of label matching and aggregation
- Correct filtering of pod phases
observability/rules/containermemorylimits.rules.yml (1)
1-34
: Consistent implementation with CPU metrics rules.The implementation follows the same pattern as CPU metrics rules.
tofu/aws/1-bootstrap/main.tf (1)
47-58
: Review object lock retention period.The 14-day COMPLIANCE mode lock might be too restrictive for rapid infrastructure changes. Consider:
- Using GOVERNANCE mode instead for more flexibility
- Adjusting the retention period based on your deployment cycles
✅ Verification successful
🌐 Web query:
What are the differences between S3 Object Lock COMPLIANCE and GOVERNANCE modes?
💡 Result:
Amazon S3 Object Lock provides two retention modes (COMPLIANCE and GOVERNANCE) to enforce Write-Once-Read-Many (WORM) protection with different levels of flexibility and security:
Key Differences
Retention Enforcement
- COMPLIANCE:
- No user (including root account) can modify/delete objects or adjust retention settings during the lock period[1][3][9]
- Retention period cannot be shortened once set[1][7]
- GOVERNANCE:
- Special permissions (
s3:BypassGovernanceRetention
) allow authorized users to modify/delete objects or adjust retention settings[1][4][7]- Requires explicit bypass header (
x-amz-bypass-governance-retention:true
) for modifications[4]Use Cases
COMPLIANCE GOVERNANCE Regulatory Requirements SEC 17a-4(f), FINRA 4511, CFTC 1.31 compliance[1][9] Internal data protection policies Flexibility Zero modification capability during retention[3][8] Controlled exceptions via permissions[7][9] Implementation Final production environments[7] Testing retention configurations[1][7] Security Model
- COMPLIANCE offers absolute immutability with no override options[3][9]
- GOVERNANCE maintains audit trails through required permissions and explicit bypass actions[4][7]
Both modes work with S3 Versioning and maintain protection through storage class transitions[1]. Legal holds can be applied independently in either mode to extend protection indefinitely[9].
Citations:
- 1: https://aws.amazon.com/s3/features/object-lock/
- 2: https://aws.amazon.com/fr/blogs/storage/protecting-data-with-amazon-s3-object-lock/
- 3: https://objectfirst.com/guides/immutability/s3-object-lock-for-ransomware-protection/
- 4: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock-managing.html
- 5: https://forums.veeam.com/object-storage-as-backup-target-f52/compliance-vs-governance-mode-t93839.html
- 6: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock-configure.html
- 7: https://n2ws.com/blog/aws-cloud-security-compliance/amazon-s3-object-lock-part-1
- 8: https://www.youtube.com/watch?v=TH2vngv-lC8
- 9: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html
- 10: https://aws.amazon.com/awstv/watch/dd121646b7c/
Action: Reassess the S3 Object Lock retention mode.
Based on the differences:
COMPLIANCE Mode:
- Provides absolute immutability—no one can bypass or alter the lock during the retention period.
- Once set, the retention period cannot be shortened.
GOVERNANCE Mode:
- Allows authorized users (with the appropriate
s3:BypassGovernanceRetention
permission and header) to override the retention settings if necessary.- Offers more flexibility for rapid infrastructure changes while maintaining audit capabilities.
The 14-day COMPLIANCE lock is indeed very strict, which could hinder rapid changes. If your deployment cycles require occasional overrides or adjustments, switching to GOVERNANCE mode or revising the retention period might be more appropriate.
argocd/ctrl_plane/dev/vmusers.yaml (1)
16-16
: Verify regex pattern reliability and add validation.The regex pattern
h?c-[A-Za-z0-9]+
is used for name extraction. Consider:
- Adding a comment explaining the expected format
- Implementing validation to handle cases where the pattern doesn't match
Also applies to: 31-40
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check if cluster names match the expected pattern # Get all cluster names with app-plane role kubectl get clusters -l role=app-plane -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | while read -r name; do if ! [[ $name =~ ^h?c-[A-Za-z0-9]+$ ]]; then echo "Warning: Cluster name '$name' doesn't match expected pattern" fi doneLength of output: 207
Regex Validation Requires Manual Verification
The test script did not produce the expected output because the
kubectl
command was not found. This prevents us from verifying whether the regex pattern reliably matches the intended cluster names. Please re-run the script in an environment wherekubectl
is available and ensure that cluster names are correctly validated. Alternatively, perform manual verification to check if the regex extracts the desired value and handles unmatched cases appropriately.observability/rules/node.rules.yml (1)
1-58
: LGTM! Well-structured node monitoring rules.The recording rules follow best practices for Kubernetes node monitoring, including CPU utilization, memory availability, and node information collection.
argocd/ctrl_plane/dev/manifests/grafana.yaml (1)
52-72
: Verify TLS configuration for ingress.The ingress is configured with TLS, but ensure:
- The certificate is properly managed by GKE
- The secret 'core-cert' exists in the cluster
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if the managed certificate exists kubectl get managedcertificate grafana-managed-cert -n observability # Verify the TLS secret exists kubectl get secret core-cert -n observabilityLength of output: 314
TLS Ingress Configuration: Manual Verification Required
The shell script output indicates that the
kubectl
command wasn’t available in this environment, so we couldn’t automatically verify whether the managed certificate (grafana-managed-cert
) and the TLS secret (core-cert
) exist in theobservability
namespace. Please run these commands manually in an environment wherekubectl
is available to confirm that:
- The managed certificate
grafana-managed-cert
is properly configured and associated in GKE.- The secret
core-cert
exists in theobservability
namespace and holds the expected TLS certificate data.README.md (1)
18-40
: LGTM! Clear and well-organized documentation.The script descriptions are concise and informative, providing a good overview of available tools.
🧰 Tools
🪛 LanguageTool
[style] ~21-~21: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ade.sh: Upgrade OpenTofu dependency. 4. tofu_new_workspace.sh: Create workspace for ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...sh: Create workspace for deployment. 5. tofu_select_workspace.sh: Switch between wor...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~23-~23: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: Switch between workspaces. 6. tofu_delete_workspace.sh: Delete workspace. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~24-~24: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...lete_workspace.sh: Delete workspace. 7. tofu_list_workspace.sh: List available works...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~24-~24: Possible missing preposition found.
Context: ...kspace. 7. tofu_list_workspace.sh: List available workspaces. 8. tofu_plan.sh: Generate e...(AI_HYDRA_LEO_MISSING_OF)
[style] ~25-~25: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: List available workspaces. 8. tofu_plan.sh: Generate execution plan to be ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~26-~26: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...erate execution plan to be deployed. 9. tofu_plan_aws.sh: Generate execution plan to...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~27-~27: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the AWS target. 10. tofu_plan_k8s.sh: Generate execution plan to...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~28-~28: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the K8S target. 11. tofu_apply.sh: Deploy the plan to the cloud ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~29-~29: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...loy the plan to the cloud provider. 12. tofu_apply_aws.sh: Deploy the AWS target to ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e AWS target to the cloud provider. 13. tofu_apply_k8s.sh: Deploy the K8S target to ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~31-~31: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e K8S target to the cloud provider. 14. tofu_destroy.sh: Delete the deployment from ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~32-~32: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...deployment from the cloud provider. 15. tofu_output.sh: Show deployment output. 16. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~33-~33: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ..._output.sh: Show deployment output. 16. tofu_show.sh: Show the state configuration. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w.sh: Show the state configuration. 17. tofu_test.sh: Run Tofu tests. 18. aws_update...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... Port forward into the grafana gui. 21. kubectl_connect_prometheus.sh: Port forward int...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rt forward into the prometheus gui. 22. kubectl_connect_alertmanager.sh: Port forward i...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
observability/rules/general.rules.yml (1)
1-49
: LGTM! Well-documented alerting rules with appropriate thresholds.The alerting rules are well-structured with:
- Clear descriptions and runbook URLs
- Appropriate thresholds for target monitoring
- Good use of inhibition patterns for info-level alerts
observability/rules/vmoperator.rules.yml (1)
1-75
: LGTM! Well-structured alert rules for VictoriaMetrics operator monitoring.The alert rules are comprehensive and well-designed with:
- Appropriate thresholds and evaluation periods
- Clear descriptions and summaries
- Helpful dashboard links for troubleshooting
- Proper severity levels
observability/rules/kube-state-metrics.rules.yml (1)
1-61
: LGTM! Comprehensive alert rules for kube-state-metrics monitoring.The alert rules are well-implemented with:
- Appropriate error rate thresholds (1%)
- Reasonable evaluation periods
- Helpful runbook links for troubleshooting
- Critical severity for potential metric exposure issues
tofu/aws/3-application_plane/main.tf (1)
108-120
: Consider enabling the CloudTrail module.The CloudTrail module is commented out but appears ready for use. Consider enabling it to enhance audit logging capabilities.
Would you like me to help you enable and configure the CloudTrail module?
tofu/gcp/observability_stack/control_plane/infra/main.tf (2)
1-8
: Address TODO comment about IP range.The TODO comment suggests changing the subnet range to /24, but this might be too restrictive for the current setup. Please verify the IP range requirements.
Would you like me to help analyze the IP range requirements based on the node pools and services configuration?
128-130
: Set master_ipv4_cidr_block for GKE cluster.The master CIDR block is currently commented out. This should be set to ensure proper network isolation for the control plane.
Would you like me to help calculate an appropriate CIDR block that doesn't overlap with other ranges?
scripts/add_cluster.sh (1)
14-18
: LGTM! Robust error handling setup.The script implements proper error handling with line number reporting and interrupt handling.
observability/grafana/dashboards/vmauth.json (1)
1-25
: LGTM! Well-structured dashboard configuration.The dashboard is properly configured with:
- Clear description and version requirements
- Built-in annotations for alerts
- Appropriate metadata
observability/grafana/dashboards/namespace-overview.json (1)
2603-2609
: LGTM! Appropriate refresh rate and tagging.The dashboard configuration includes:
- 30-second refresh rate
- Relevant tags for Kubernetes and Prometheus
- Proper schema version
argocd/ctrl_plane/dev/kustomize/vmuser/kustomization.yaml (2)
1-4
: LGTM! Well-structured Kustomization configuration.The configuration follows the standard format with appropriate apiVersion and descriptive metadata name.
5-8
: Verify the referenced resources exist and follow security best practices.The namespace and resource references look good. Please ensure:
- Both
vmuser.yaml
andstring-secret.yaml
exist in the same directory- The StringSecret follows security best practices for credential management
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Description: Verify the existence of referenced resources and check for potential security issues # Check if the referenced files exist fd -t f "vmuser.yaml|string-secret.yaml" "argocd/ctrl_plane/dev/kustomize/vmuser" # Check for potential security issues in the secret configuration rg -i "password:|secret:|key:" "argocd/ctrl_plane/dev/kustomize/vmuser/string-secret.yaml"Length of output: 281
Resource Verification Complete: Both files exist and show no obvious security issues.
- Confirmed that
vmuser.yaml
andstring-secret.yaml
are present inargocd/ctrl_plane/dev/kustomize/vmuser
- No plain-text sensitive keys (e.g.,
password
,secret
,key
) were found instring-secret.yaml
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 8-8: no new line character at the end of file
(new-line-at-end-of-file)
Footnotes
content: |- | ||
prometheus.remote_write "in_cluster" { | ||
endpoint { | ||
url = "http://vmsingle-vm.observability.svc.cluster.local:8429/prometheus/api/v1/write" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify VictoriaMetrics endpoint availability.
The remote write endpoint assumes VictoriaMetrics is deployed in the observability namespace. Ensure the service exists and is accessible.
🏁 Script executed:
#!/bin/bash
# Check if the VictoriaMetrics service exists in the observability namespace
kubectl get svc vmsingle-vm -n observability 2>/dev/null || echo "Service not found"
Length of output: 91
VictoriaMetrics service not found
The verification script clearly shows that the VictoriaMetrics service ("vmsingle-vm") is missing in the "observability" namespace, meaning that the remote write endpoint is currently unreachable. Please verify the deployment details or update the endpoint if the expected service name differs.
nodeSelector: | ||
node_pool: observability-resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify node pool existence.
Ensure the node pool with label node_pool: observability-resources
exists in your GKE/EKS clusters.
🏁 Script executed:
#!/bin/bash
# Check if nodes with the required label exist
kubectl get nodes -l node_pool=observability-resources 2>/dev/null || echo "No nodes found with required label"
Length of output: 135
Action Required: Node Pool "observability-resources" not found
The script output indicates that no nodes have the label node_pool=observability-resources
in the current cluster configuration. Please verify if:
- The node pool is intended to exist in this cluster.
- The pool has been correctly labeled.
- A configuration update is required either in the cluster or in the ArgoCD manifest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
README.md (1)
1-56
: 🛠️ Refactor suggestionAdd documentation for the observability stack.
Given this PR's focus on observability enhancements, please add a new section documenting:
- Observability stack architecture
- Components (VictoriaMetrics, Grafana, etc.)
- ArgoCD integration
- Monitoring and alerting setup
🧰 Tools
🪛 LanguageTool
[style] ~21-~21: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ade.sh: Upgrade OpenTofu dependency. 4. tofu_new_workspace.sh: Create workspace for ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...sh: Create workspace for deployment. 5. tofu_select_workspace.sh: Switch between wor...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~23-~23: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: Switch between workspaces. 6. tofu_delete_workspace.sh: Delete workspace. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~24-~24: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...lete_workspace.sh: Delete workspace. 7. tofu_list_workspace.sh: List available works...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~24-~24: Possible missing preposition found.
Context: ...kspace. 7. tofu_list_workspace.sh: List available workspaces. 8. tofu_plan.sh: Generate e...(AI_HYDRA_LEO_MISSING_OF)
[style] ~25-~25: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: List available workspaces. 8. tofu_plan.sh: Generate execution plan to be ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~26-~26: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...erate execution plan to be deployed. 9. tofu_plan_aws.sh: Generate execution plan to...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~27-~27: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the AWS target. 10. tofu_plan_k8s.sh: Generate execution plan to...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~28-~28: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the K8S target. 11. tofu_apply.sh: Deploy the plan to the cloud ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~29-~29: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...loy the plan to the cloud provider. 12. tofu_apply_aws.sh: Deploy the AWS target to ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e AWS target to the cloud provider. 13. tofu_apply_k8s.sh: Deploy the K8S target to ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~31-~31: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e K8S target to the cloud provider. 14. tofu_destroy.sh: Delete the deployment from ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~32-~32: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...deployment from the cloud provider. 15. tofu_output.sh: Show deployment output. 16. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~33-~33: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ..._output.sh: Show deployment output. 16. tofu_show.sh: Show the state configuration. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w.sh: Show the state configuration. 17. tofu_test.sh: Run Tofu tests. 18. aws_update...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... Port forward into the grafana gui. 21. kubectl_connect_prometheus.sh: Port forward int...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rt forward into the prometheus gui. 22. kubectl_connect_alertmanager.sh: Port forward i...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🧹 Nitpick comments (1)
README.md (1)
18-18
: Enhance documentation for the newadd_cluster.sh
script.Given this script's importance in setting up observability, please provide more detailed documentation including:
- Required environment variables
- Prerequisites (e.g., gum CLI tool)
- Step-by-step usage instructions
- Expected outcomes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md
[style] ~21-~21: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ade.sh: Upgrade OpenTofu dependency. 4. tofu_new_workspace.sh: Create workspace for ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...sh: Create workspace for deployment. 5. tofu_select_workspace.sh: Switch between wor...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~23-~23: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: Switch between workspaces. 6. tofu_delete_workspace.sh: Delete workspace. ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~24-~24: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...lete_workspace.sh: Delete workspace. 7. tofu_list_workspace.sh: List available works...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~24-~24: Possible missing preposition found.
Context: ...kspace. 7. tofu_list_workspace.sh: List available workspaces. 8. tofu_plan.sh: Generate e...
(AI_HYDRA_LEO_MISSING_OF)
[style] ~25-~25: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: List available workspaces. 8. tofu_plan.sh: Generate execution plan to be ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~26-~26: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...erate execution plan to be deployed. 9. tofu_plan_aws.sh: Generate execution plan to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~27-~27: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the AWS target. 10. tofu_plan_k8s.sh: Generate execution plan to...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~28-~28: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the K8S target. 11. tofu_apply.sh: Deploy the plan to the cloud ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~29-~29: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...loy the plan to the cloud provider. 12. tofu_apply_aws.sh: Deploy the AWS target to ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e AWS target to the cloud provider. 13. tofu_apply_k8s.sh: Deploy the K8S target to ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~31-~31: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e K8S target to the cloud provider. 14. tofu_destroy.sh: Delete the deployment from ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~32-~32: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...deployment from the cloud provider. 15. tofu_output.sh: Show deployment output. 16. ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~33-~33: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ..._output.sh: Show deployment output. 16. tofu_show.sh: Show the state configuration. ...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w.sh: Show the state configuration. 17. tofu_test.sh: Run Tofu tests. 18. aws_update...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... Port forward into the grafana gui. 21. kubectl_connect_prometheus.sh: Port forward int...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rt forward into the prometheus gui. 22. kubectl_connect_alertmanager.sh: Port forward i...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🔇 Additional comments (1)
README.md (1)
19-40
: LGTM! Well-organized script listing.The reordering maintains a logical grouping of related scripts while preserving clear descriptions.
🧰 Tools
🪛 LanguageTool
[style] ~21-~21: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ade.sh: Upgrade OpenTofu dependency. 4. tofu_new_workspace.sh: Create workspace for ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~22-~22: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...sh: Create workspace for deployment. 5. tofu_select_workspace.sh: Switch between wor...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~23-~23: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: Switch between workspaces. 6. tofu_delete_workspace.sh: Delete workspace. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~24-~24: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...lete_workspace.sh: Delete workspace. 7. tofu_list_workspace.sh: List available works...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[uncategorized] ~24-~24: Possible missing preposition found.
Context: ...kspace. 7. tofu_list_workspace.sh: List available workspaces. 8. tofu_plan.sh: Generate e...(AI_HYDRA_LEO_MISSING_OF)
[style] ~25-~25: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...space.sh: List available workspaces. 8. tofu_plan.sh: Generate execution plan to be ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~26-~26: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...erate execution plan to be deployed. 9. tofu_plan_aws.sh: Generate execution plan to...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~27-~27: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the AWS target. 10. tofu_plan_k8s.sh: Generate execution plan to...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~28-~28: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... to be deployed for the K8S target. 11. tofu_apply.sh: Deploy the plan to the cloud ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~29-~29: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...loy the plan to the cloud provider. 12. tofu_apply_aws.sh: Deploy the AWS target to ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e AWS target to the cloud provider. 13. tofu_apply_k8s.sh: Deploy the K8S target to ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~31-~31: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e K8S target to the cloud provider. 14. tofu_destroy.sh: Delete the deployment from ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~32-~32: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...deployment from the cloud provider. 15. tofu_output.sh: Show deployment output. 16. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~33-~33: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ..._output.sh: Show deployment output. 16. tofu_show.sh: Show the state configuration. ...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w.sh: Show the state configuration. 17. tofu_test.sh: Run Tofu tests. 18. aws_update...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... Port forward into the grafana gui. 21. kubectl_connect_prometheus.sh: Port forward int...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rt forward into the prometheus gui. 22. kubectl_connect_alertmanager.sh: Port forward i...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
extract ip ranges to variables remove psc subnet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tofu/gcp/observability_stack/control_plane/infra/main.tf (1)
6-34
: Consider adding VPC flow logs for enhanced network visibility.Given this is an observability stack, enabling VPC flow logs would provide valuable network traffic insights.
module "vpc" { source = "terraform-google-modules/network/google" version = "~> 9.0" project_id = var.project_id network_name = "observability-stack-network" routing_mode = "REGIONAL" auto_create_subnetworks = false + # Enable flow logs for better network observability + enable_flow_logs = true + flow_logs_sampling = 0.5 + flow_logs_metadata = "INCLUDE_ALL_METADATA"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tofu/aws/3-application_plane/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/main.tf
(1 hunks)tofu/gcp/observability_stack/control_plane/infra/variables.tf
(1 hunks)
🧰 Additional context used
🪛 Checkov (3.2.334)
tofu/aws/3-application_plane/main.tf
[HIGH] 59-70: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
[HIGH] 95-106: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
🔇 Additional comments (8)
tofu/aws/3-application_plane/main.tf (5)
2-21
: LGTM! Well-structured organization and account management.The implementation follows AWS best practices with proper error handling using
try()
and a clear naming convention for S3 buckets.
23-25
: LGTM! Good choice for bucket name uniqueness.Using 4 bytes provides sufficient randomization while keeping bucket names reasonably short.
28-33
: LGTM! Secure provider configuration.The implementation follows AWS security best practices by using role assumption with the organization account access role.
59-70
: Add version tag to the S3 bucket module.The module source should include a version tag for better version control and reproducibility.
- source = "trussworks/s3-private-bucket/aws" + source = "trussworks/s3-private-bucket/aws?ref=v5.3.1"🧰 Tools
🪛 Checkov (3.2.334)
[HIGH] 59-70: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
95-106
: Add version tag to the access logs S3 bucket module.The module source should include a version tag for better version control and reproducibility.
- source = "trussworks/s3-private-bucket/aws" + source = "trussworks/s3-private-bucket/aws?ref=v5.3.1"🧰 Tools
🪛 Checkov (3.2.334)
[HIGH] 95-106: Ensure Terraform module sources use a tag with a version number
(CKV_TF_2)
tofu/gcp/observability_stack/control_plane/infra/main.tf (3)
143-170
: Verify the high max node count in public node pool.The public node pool's max_node_count of 220 seems unusually high and could lead to significant costs if autoscaling reaches this limit.
Consider if this is intentionally sized for your workload requirements.
172-208
: LGTM! Storage bucket configuration looks secure and well-optimized.The storage bucket configuration includes:
- Enforced public access prevention
- Appropriate lifecycle rules for cost optimization
- Regional location matching the cluster
87-141
: Review GKE cluster security configurations.Several security-related concerns in the GKE cluster configuration:
- Legacy metadata endpoints are enabled (
disable_legacy_metadata_endpoints = false
)- Network policy is disabled (
network_policy = false
)- Basic security posture mode might not be sufficient for production
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if any existing GKE clusters in the organization use enhanced security settings gcloud container clusters list --format="table(name,securityPosture.mode,networkPolicy.enabled)" --filter="status=RUNNING"Length of output: 219
GKE Cluster Security Settings Require Manual Verification
The code snippet indicates that legacy metadata endpoints are enabled (
disable_legacy_metadata_endpoints = false
), network policy is disabled (network_policy = false
), and a basic security posture is set, which are valid security concerns. However, the initial gcloud command could not run (gcloud is not installed in this environment), so we could not automatically verify the runtime settings. Please manually verify the live GKE cluster configurations using an environment where gcloud is available.
variable "project_id" { | ||
type = string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add description and validation for project_id variable.
The variable lacks a description and validation rules. This information is crucial for users of the module.
variable "project_id" {
type = string
+ description = "The GCP project ID where resources will be created"
+ validation {
+ condition = can(regex("^[a-z][a-z0-9-]{4,28}[a-z0-9]$", var.project_id))
+ error_message = "Project ID must be between 6 and 30 characters, start with a letter, and contain only lowercase letters, numbers, and hyphens."
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
variable "project_id" { | |
type = string | |
} | |
variable "project_id" { | |
type = string | |
description = "The GCP project ID where resources will be created" | |
validation { | |
condition = can(regex("^[a-z][a-z0-9-]{4,28}[a-z0-9]$", var.project_id)) | |
error_message = "Project ID must be between 6 and 30 characters, start with a letter, and contain only lowercase letters, numbers, and hyphens." | |
} | |
} |
variable "ip_range_subnet" { | ||
type = string | ||
} | ||
|
||
variable "ip_range_pods" { | ||
type = string | ||
} | ||
|
||
variable "ip_range_services" { | ||
type = string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add descriptions and CIDR validation for IP range variables.
The IP range variables lack descriptions and CIDR validation rules.
variable "ip_range_subnet" {
type = string
+ description = "The primary IP range for the subnet in CIDR notation"
+ validation {
+ condition = can(cidrhost(var.ip_range_subnet, 0))
+ error_message = "The subnet IP range must be a valid CIDR notation."
+ }
}
variable "ip_range_pods" {
type = string
+ description = "The secondary IP range for Kubernetes pods in CIDR notation"
+ validation {
+ condition = can(cidrhost(var.ip_range_pods, 0))
+ error_message = "The pods IP range must be a valid CIDR notation."
+ }
}
variable "ip_range_services" {
type = string
+ description = "The secondary IP range for Kubernetes services in CIDR notation"
+ validation {
+ condition = can(cidrhost(var.ip_range_services, 0))
+ error_message = "The services IP range must be a valid CIDR notation."
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
variable "ip_range_subnet" { | |
type = string | |
} | |
variable "ip_range_pods" { | |
type = string | |
} | |
variable "ip_range_services" { | |
type = string | |
} | |
variable "ip_range_subnet" { | |
type = string | |
description = "The primary IP range for the subnet in CIDR notation" | |
validation { | |
condition = can(cidrhost(var.ip_range_subnet, 0)) | |
error_message = "The subnet IP range must be a valid CIDR notation." | |
} | |
} | |
variable "ip_range_pods" { | |
type = string | |
description = "The secondary IP range for Kubernetes pods in CIDR notation" | |
validation { | |
condition = can(cidrhost(var.ip_range_pods, 0)) | |
error_message = "The pods IP range must be a valid CIDR notation." | |
} | |
} | |
variable "ip_range_services" { | |
type = string | |
description = "The secondary IP range for Kubernetes services in CIDR notation" | |
validation { | |
condition = can(cidrhost(var.ip_range_services, 0)) | |
error_message = "The services IP range must be a valid CIDR notation." | |
} | |
} |
variable "region" { | ||
type = string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add description and validation for region variable.
The region variable should include a description and validation to ensure a valid GCP region is provided.
variable "region" {
type = string
+ description = "The GCP region where resources will be created"
+ validation {
+ condition = can(regex("^[a-z]+-[a-z]+-[0-9]$", var.region))
+ error_message = "Region must be a valid GCP region (e.g., us-central1)."
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
variable "region" { | |
type = string | |
} | |
variable "region" { | |
type = string | |
description = "The GCP region where resources will be created" | |
validation { | |
condition = can(regex("^[a-z]+-[a-z]+-[0-9]$", var.region)) | |
error_message = "Region must be a valid GCP region (e.g., us-central1)." | |
} | |
} |
User description
fix #94
PR Type
Enhancement, Configuration changes, Dependencies, Other
Description
Introduced extensive enhancements to both GCP and AWS infrastructure, including VPC, GKE, S3, and Kubernetes observability configurations.
Added multiple Terraform lock files to ensure consistent provider versions and constraints across various modules.
Configured Terraform backends for both GCS and S3 to manage state files effectively.
Added numerous Grafana dashboards for monitoring Kubernetes clusters, namespaces, and specific applications like FalkorDB.
Introduced Prometheus alerting rules for Kubernetes applications, resources, and Alertmanager monitoring.
Added ArgoCD configurations for deploying VictoriaMetrics and other observability components.
Defined and updated variables and provider configurations for both GCP and AWS modules.
Included a script to automate cluster setup for GCP and AWS, streamlining deployment processes.
Removed outdated workflows and consolidated infrastructure configurations.
Changes walkthrough 📝
5 files
.terraform.lock.hcl
Add Terraform lock file for GCP Kubernetes control plane.
tofu/gcp/observability_stack/control_plane/k8s/.terraform.lock.hcl
Kubernetes, TLS, and GitHub.
.terraform.lock.hcl
Add Terraform lock file for GCP infrastructure.
tofu/gcp/observability_stack/control_plane/infra/.terraform.lock.hcl
.terraform.lock.hcl
Add Terraform lock file for AWS bootstrap.
tofu/aws/1-bootstrap/.terraform.lock.hcl
.terraform.lock.hcl
Add Terraform lock file for AWS application plane.
tofu/aws/3-application_plane/.terraform.lock.hcl
.terraform.lock.hcl
Add Terraform lock file for AWS organization.
tofu/aws/2-org/.terraform.lock.hcl
17 files
main.tf
Configure GCP infrastructure with VPC, GKE, and storage.
tofu/gcp/observability_stack/control_plane/infra/main.tf
ranges.
main.tf
Configure AWS application plane with S3 and policies.
tofu/aws/3-application_plane/main.tf
main.tf
Configure Kubernetes and Helm for GCP observability.
tofu/gcp/observability_stack/control_plane/k8s/main.tf
main.tf
Configure AWS bootstrap with S3 bucket and organizational unit.
tofu/aws/1-bootstrap/main.tf
main.tf
Configure AWS organization with workloads and accounts.
tofu/aws/2-org/main.tf
name.
outputs.tf
Add outputs for GCP infrastructure resources.
tofu/gcp/observability_stack/control_plane/infra/outputs.tf
outputs.tf
Add output for AWS organizational unit ID.
tofu/aws/1-bootstrap/outputs.tf
add_cluster.sh
Add script for automated cluster setup on GCP and AWS.
scripts/add_cluster.sh
prometheus-dashboard.json
Add Grafana dashboard for Prometheus monitoring.
observability/grafana/dashboards/prometheus-dashboard.json
kubernetes-apps.rules.yml
Add Prometheus alerting rules for Kubernetes applications.
observability/rules/kubernetes-apps.rules.yml
issues.
victoriametrics.yaml
Add ArgoCD ApplicationSet for VictoriaMetrics deployment.
argocd/app_plane/dev/victoriametrics.yaml
cluster-overview.json
Added Grafana dashboard for Kubernetes cluster monitoring.
observability/grafana/dashboards/cluster-overview.json
clusters.
Kubernetes resources.
selection.
bargauge panels.
alertmanager.rules.yml
Added Alertmanager monitoring rules for critical alerts.
observability/rules/alertmanager.rules.yml
inconsistencies, and notification failures.
crashlooping.
kubernetes-resources.rules.yml
Added Kubernetes resource monitoring alert rules.
observability/rules/kubernetes-resources.rules.yml
quotas.
podowner.rules.yml
Added pod owner mapping rules for Kubernetes workloads.
observability/rules/podowner.rules.yml
namespace-overview.json
Added Kubernetes Namespace Overview Grafana Dashboard.
observability/grafana/dashboards/namespace-overview.json
resolution.
falkordb-cloud.json
Added FalkorDB Monitoring Grafana Dashboard.
observability/grafana/dashboards/falkordb-cloud.json
statistics.
15 files
variables.tf
Add variables for GCP Kubernetes configuration.
tofu/gcp/observability_stack/control_plane/k8s/variables.tf
settings.
variables.tf
Add variables for AWS application plane configuration.
tofu/aws/3-application_plane/variables.tf
settings.
providers.tf
Add provider configurations for GCP Kubernetes.
tofu/gcp/observability_stack/control_plane/k8s/providers.tf
GitHub.
variables.tf
Add variables for AWS bootstrap configuration.
tofu/aws/1-bootstrap/variables.tf
variables.tf
Add variables for AWS organization configuration.
tofu/aws/2-org/variables.tf
application plane account.
providers.tf
Add provider configuration for GCP infrastructure.
tofu/gcp/observability_stack/control_plane/infra/providers.tf
providers.tf
Add provider configuration for AWS bootstrap.
tofu/aws/1-bootstrap/providers.tf
backend.tf
Add S3 backend configuration for AWS bootstrap.
tofu/aws/1-bootstrap/backend.tf
variables.tf
Add variables for GCP infrastructure configuration.
tofu/gcp/observability_stack/control_plane/infra/variables.tf
backend.tf
Add GCS backend configuration for GCP Kubernetes.
tofu/gcp/observability_stack/control_plane/k8s/backend.tf
backend.tf
Add GCS backend configuration for GCP infrastructure.
tofu/gcp/observability_stack/control_plane/infra/backend.tf
backend.tf
Add S3 backend configuration for AWS application plane.
tofu/aws/3-application_plane/backend.tf
backend.tf
Add S3 backend configuration for AWS organization.
tofu/aws/2-org/backend.tf
victoriametrics.yaml
Added ArgoCD Configuration for VictoriaMetrics Deployment.
argocd/ctrl_plane/dev/victoriametrics.yaml
balancer.
kustomization.yaml
Added Kustomize Configuration for Grafana Dashboards.
observability/grafana/kustomization.yaml
1 files
main.tf
Minor formatting adjustment in GCP bootstrap module.
tofu/gcp/bootstrap/seed_project/main.tf
65 files
Summary by CodeRabbit
New Features
Documentation
Chores