[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

sfudeus · 2025-01-07T09:40:04Z

What happened?

As a jaeger operator, I want to have the elasticsearch-indexes cleaned up stably and reliably, even if shared across multiple clusters.

Steps to reproduce

Deploy the index-cleanup cronjob across multiple clusters
Because of the same cronjob definition in the helm charts, all jobs start at the same time
Some job instances are failing on their first run because they first gather a list of indices to delete, but then fail deleting them because they cannot find the index they want to cleanup anymore
The K8s-Job itself succeeds, because the job is spawning another pod which will then succeed, because the index-listing already has no delete-candidates anymore, but there was still a pod failing which can trigger alerts

Expected behavior

I'd expect to not solely rely on the retry primitives of a K8s job, but that the initial run of a cleanup job itself can cope with an index already being deleted, since deletion is an idempotent operation anyway.

Relevant log output

{"level":"info","ts":1730501701.360364,"caller":"es-index-cleaner/main.go:89","msg":"Indices before this date will be deleted","date":"2024-10-26T00:00:00Z"}
{"level":"info","ts":1730501701.360454,"caller":"es-index-cleaner/main.go:98","msg":"Queried indices","indices":[{"Index":"jaeger-service-2024-10-26","CreationTime":"2024-10-26T00:00:02.651Z","Aliases":{}},{"Index":"jaeger-service-2024-10-28","CreationTime":"2024-10-28T00:00:00.517Z","Aliases":{}},{"Index":"jaeger-span-2024-10-31","CreationTime":"2024-10-31T00:00:00.55Z","Aliases":{}},{"Index":"jaeger-service-2024-10-31","CreationTime":"2024-10-31T00:00:00.155Z","Aliases":{}},{"Index":"jaeger-span-2024-10-27","CreationTime":"2024-10-27T00:00:00.319Z","Aliases":{}},{"Index":"jaeger-span-2024-10-29","CreationTime":"2024-10-29T00:00:00.237Z","Aliases":{}},{"Index":"jaeger-service-2024-10-27","CreationTime":"2024-10-27T00:00:02.093Z","Aliases":{}},{"Index":"jaeger-service-2024-10-29","CreationTime":"2024-10-29T00:00:03.637Z","Aliases":{}},{"Index":"jaeger-span-2024-10-25","CreationTime":"2024-10-25T00:00:00.24Z","Aliases":{}},{"Index":"jaeger-span-2024-10-28","CreationTime":"2024-10-28T00:00:00.229Z","Aliases":{}},{"Index":"jaeger-span-2024-10-30","CreationTime":"2024-10-30T00:00:00.231Z","Aliases":{}},{"Index":"jaeger-span-2024-11-01","CreationTime":"2024-11-01T00:00:00.142Z","Aliases":{}},{"Index":"jaeger-service-2024-10-25","CreationTime":"2024-10-25T00:00:02.468Z","Aliases":{}},{"Index":"jaeger-service-2024-10-30","CreationTime":"2024-10-30T00:00:01.286Z","Aliases":{}},{"Index":"jaeger-service-2024-11-01","CreationTime":"2024-11-01T00:00:01.143Z","Aliases":{}},{"Index":"jaeger-span-2024-10-26","CreationTime":"2024-10-26T00:00:00.364Z","Aliases":{}}]}
{"level":"info","ts":1730501701.360637,"caller":"es-index-cleaner/main.go:105","msg":"Deleting indices","indices":[{"Index":"jaeger-span-2024-10-25","CreationTime":"2024-10-25T00:00:00.24Z","Aliases":{}},{"Index":"jaeger-service-2024-10-25","CreationTime":"2024-10-25T00:00:02.468Z","Aliases":{}}]}
Error: failed to delete indices: jaeger-span-2024-10-25,jaeger-service-2024-10-25,, request failed, status code: 404, body: {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [jaeger-service-2024-10-25]","index_uuid":"56pWYj7FRgiU-YimsljoYg","index":"jaeger-service-2024-10-25"}],"type":"index_not_found_exception","reason":"no such index [jaeger-service-2024-10-25]","index_uuid":"56pWYj7FRgiU-YimsljoYg","index":"jaeger-service-2024-10-25"},"status":404}

Screenshot

No response

Additional context

No response

Jaeger backend version

v1.64.0

SDK

No response

Pipeline

No response

Stogage backend

Elasticsearch v7.17.26

Operating system

Linux

Deployment model

Kubernetes

Deployment configs

apiVersion: batch/v1
kind: CronJob
...
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      parallelism: 1
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "false"
            sidecar.istio.io/inject: "false"
          creationTimestamp: null
          labels:
            app: jaeger
            app.kubernetes.io/component: cronjob-es-index-cleaner
            app.kubernetes.io/instance: jaeger
            app.kubernetes.io/managed-by: jaeger-operator
            app.kubernetes.io/name: jaeger-es-index-cleaner
            app.kubernetes.io/part-of: jaeger
        spec:
          containers:
          - args:
            - "7"
            - $my-es-hostname
            env:
            - name: ES_TLS_ENABLED
              value: "true"
            - name: ES_TLS_CA
              value: /es/certificates/ca-certificates.crt
            envFrom:
            - secretRef:
                name: jaeger-elasticsearch
            image: docker.io/jaegertracing/jaeger-es-index-cleaner:1.56.0
            imagePullPolicy: IfNotPresent
            name: jaeger-es-index-cleaner
            resources:
              limits:
                cpu: "1"
                ephemeral-storage: 250Mi
                memory: 1Gi
              requests:
                cpu: 250m
                ephemeral-storage: 250Mi
                memory: 256Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /es/certificates/
              name: certificates
              readOnly: true
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: jaeger
          serviceAccountName: jaeger
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: jaeger-cabundle
            name: certificates
  schedule: 55 23 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

Fix for bug jaegertracing#6497. Add query param ignore_unavailable to have index client not err when deleting, already deleted indexes. Signed-off-by: Shreyas Kirtane <[email protected]>

skirtan1 · 2025-01-07T20:50:05Z

ES has an optional flag to ignore missing indexes. Added an option to es-index-cleanup, so this flag can be set in #6502. This should resolve the problem.

sfudeus added the bug label Jan 7, 2025

dosubot bot added area/storage storage/elasticsearch labels Jan 7, 2025

skirtan1 linked a pull request Jan 7, 2025 that will close this issue

fix: Add flag to ignore missing index #6502

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

sfudeus commented Jan 7, 2025

skirtan1 commented Jan 7, 2025

[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

Comments

sfudeus commented Jan 7, 2025

What happened?

Steps to reproduce

Expected behavior

Relevant log output

Screenshot

Additional context

Jaeger backend version

SDK

Pipeline

Stogage backend

Operating system

Deployment model

Deployment configs

skirtan1 commented Jan 7, 2025