Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

Open
sfudeus opened this issue Jan 7, 2025 · 1 comment · May be fixed by #6502
Open

[Bug]: Jaeger es-index-cleanup fails on parallel execution #6497

sfudeus opened this issue Jan 7, 2025 · 1 comment · May be fixed by #6502

Comments

@sfudeus
Copy link

sfudeus commented Jan 7, 2025

What happened?

As a jaeger operator, I want to have the elasticsearch-indexes cleaned up stably and reliably, even if shared across multiple clusters.

Steps to reproduce

  1. Deploy the index-cleanup cronjob across multiple clusters
  2. Because of the same cronjob definition in the helm charts, all jobs start at the same time
  3. Some job instances are failing on their first run because they first gather a list of indices to delete, but then fail deleting them because they cannot find the index they want to cleanup anymore
  4. The K8s-Job itself succeeds, because the job is spawning another pod which will then succeed, because the index-listing already has no delete-candidates anymore, but there was still a pod failing which can trigger alerts

Expected behavior

I'd expect to not solely rely on the retry primitives of a K8s job, but that the initial run of a cleanup job itself can cope with an index already being deleted, since deletion is an idempotent operation anyway.

Relevant log output

{"level":"info","ts":1730501701.360364,"caller":"es-index-cleaner/main.go:89","msg":"Indices before this date will be deleted","date":"2024-10-26T00:00:00Z"}
{"level":"info","ts":1730501701.360454,"caller":"es-index-cleaner/main.go:98","msg":"Queried indices","indices":[{"Index":"jaeger-service-2024-10-26","CreationTime":"2024-10-26T00:00:02.651Z","Aliases":{}},{"Index":"jaeger-service-2024-10-28","CreationTime":"2024-10-28T00:00:00.517Z","Aliases":{}},{"Index":"jaeger-span-2024-10-31","CreationTime":"2024-10-31T00:00:00.55Z","Aliases":{}},{"Index":"jaeger-service-2024-10-31","CreationTime":"2024-10-31T00:00:00.155Z","Aliases":{}},{"Index":"jaeger-span-2024-10-27","CreationTime":"2024-10-27T00:00:00.319Z","Aliases":{}},{"Index":"jaeger-span-2024-10-29","CreationTime":"2024-10-29T00:00:00.237Z","Aliases":{}},{"Index":"jaeger-service-2024-10-27","CreationTime":"2024-10-27T00:00:02.093Z","Aliases":{}},{"Index":"jaeger-service-2024-10-29","CreationTime":"2024-10-29T00:00:03.637Z","Aliases":{}},{"Index":"jaeger-span-2024-10-25","CreationTime":"2024-10-25T00:00:00.24Z","Aliases":{}},{"Index":"jaeger-span-2024-10-28","CreationTime":"2024-10-28T00:00:00.229Z","Aliases":{}},{"Index":"jaeger-span-2024-10-30","CreationTime":"2024-10-30T00:00:00.231Z","Aliases":{}},{"Index":"jaeger-span-2024-11-01","CreationTime":"2024-11-01T00:00:00.142Z","Aliases":{}},{"Index":"jaeger-service-2024-10-25","CreationTime":"2024-10-25T00:00:02.468Z","Aliases":{}},{"Index":"jaeger-service-2024-10-30","CreationTime":"2024-10-30T00:00:01.286Z","Aliases":{}},{"Index":"jaeger-service-2024-11-01","CreationTime":"2024-11-01T00:00:01.143Z","Aliases":{}},{"Index":"jaeger-span-2024-10-26","CreationTime":"2024-10-26T00:00:00.364Z","Aliases":{}}]}
{"level":"info","ts":1730501701.360637,"caller":"es-index-cleaner/main.go:105","msg":"Deleting indices","indices":[{"Index":"jaeger-span-2024-10-25","CreationTime":"2024-10-25T00:00:00.24Z","Aliases":{}},{"Index":"jaeger-service-2024-10-25","CreationTime":"2024-10-25T00:00:02.468Z","Aliases":{}}]}
Error: failed to delete indices: jaeger-span-2024-10-25,jaeger-service-2024-10-25,, request failed, status code: 404, body: {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [jaeger-service-2024-10-25]","index_uuid":"56pWYj7FRgiU-YimsljoYg","index":"jaeger-service-2024-10-25"}],"type":"index_not_found_exception","reason":"no such index [jaeger-service-2024-10-25]","index_uuid":"56pWYj7FRgiU-YimsljoYg","index":"jaeger-service-2024-10-25"},"status":404}

Screenshot

No response

Additional context

No response

Jaeger backend version

v1.64.0

SDK

No response

Pipeline

No response

Stogage backend

Elasticsearch v7.17.26

Operating system

Linux

Deployment model

Kubernetes

Deployment configs

apiVersion: batch/v1
kind: CronJob
...
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      parallelism: 1
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "false"
            sidecar.istio.io/inject: "false"
          creationTimestamp: null
          labels:
            app: jaeger
            app.kubernetes.io/component: cronjob-es-index-cleaner
            app.kubernetes.io/instance: jaeger
            app.kubernetes.io/managed-by: jaeger-operator
            app.kubernetes.io/name: jaeger-es-index-cleaner
            app.kubernetes.io/part-of: jaeger
        spec:
          containers:
          - args:
            - "7"
            - $my-es-hostname
            env:
            - name: ES_TLS_ENABLED
              value: "true"
            - name: ES_TLS_CA
              value: /es/certificates/ca-certificates.crt
            envFrom:
            - secretRef:
                name: jaeger-elasticsearch
            image: docker.io/jaegertracing/jaeger-es-index-cleaner:1.56.0
            imagePullPolicy: IfNotPresent
            name: jaeger-es-index-cleaner
            resources:
              limits:
                cpu: "1"
                ephemeral-storage: 250Mi
                memory: 1Gi
              requests:
                cpu: 250m
                ephemeral-storage: 250Mi
                memory: 256Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /es/certificates/
              name: certificates
              readOnly: true
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: jaeger
          serviceAccountName: jaeger
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: jaeger-cabundle
            name: certificates
  schedule: 55 23 * * *
  successfulJobsHistoryLimit: 3
  suspend: false
@sfudeus sfudeus added the bug label Jan 7, 2025
skirtan1 added a commit to skirtan1/jaeger that referenced this issue Jan 7, 2025
Fix for bug jaegertracing#6497. Add query param ignore_unavailable
to have index client not err when deleting, already
deleted indexes.

Signed-off-by: Shreyas Kirtane <[email protected]>
@skirtan1 skirtan1 linked a pull request Jan 7, 2025 that will close this issue
4 tasks
@skirtan1
Copy link

skirtan1 commented Jan 7, 2025

ES has an optional flag to ignore missing indexes. Added an option to es-index-cleanup, so this flag can be set in #6502. This should resolve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants