Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

Open
JasonXD-CS opened this issue Apr 1, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@JasonXD-CS
Copy link

JasonXD-CS commented Apr 1, 2024

What steps did you take and what happened:
We have an Azure Key Vault with average requests lower than the KeyVault throttling limit.
and recently ran into outage when started using CSI Driver with auto rotation with 3 hours interval.
We had regular scale up at peak hours and that triggered Key Vault throttling and continue to be throttling for hours until we disable auto rotation.
As a result, none of the services could be created as it's stuck in ContainerCreating state and we had to revert back to use KeyVaultAgent.

After investigation we discovered a few issue with reconciler implementation:

  1. Current auto rotation design is inefficient and not scalable as it's rotating secrets per pod, which makes a lot of unnecessary requests. If a deployment running 1000 replicas is downloading 10 secrets each. The amount of extra requests made is 10000 vs 10 if rotating per deployment.

  2. According to this post Understanding how many calls are made to KeyVault?

"In case of rotation, the list of all existing pods is generated and the contents in the mount are rotated linearly. A workqueue is created in the driver, all 10k pods added to queue and then the queue is processed"

However, workqueue does not handle 429 and no exponential backoff.

newObjectVersions, errorReason, err := secretsstore.MountContent(ctx, providerClient, string(paramsJSON), string(secretsJSON), spcps.Status.TargetPath, string(permissionJSON), oldObjectVersions)
if err != nil {
r.generateEvent(pod, corev1.EventTypeWarning, mountRotationFailedReason, fmt.Sprintf("provider mount err: %+v", err))
return fmt.Errorf("failed to rotate objects for pod %s/%s, err: %w", spcps.Namespace, spcps.Status.PodName, err)
}

So each task in the queue doesn't know anything about 429 throttling and it will just continue to process these requests in the workqueue without backoff. So it doesn't give any time for Key Vault recover as it continues to make these requests.
This is amplified when there are thousands of nodes pulling from the same KeyVault.

  1. It requeues the failed request back to the workqueue after 10 seconds delay with no maxRequeueLimit. So that creates an infinite loop if KeyVault could not recover from throttling
    https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/bf86dbf98ad3e32a0f55e52d6a411abd6784f7fb/pkg/rotation/reconciler.go#L242C1-L243C1

What did you expect to happen:

  1. KeyVault throttling 429 TooManyRequest should be properly handled in the reconciler.
  2. Exponential backoff for rotation requests processing.

Anything else you would like to add:
Regarding #1, the current design is not scalable.
Would love to hear from the team what's the plan for optimizing this going forward.

Also, does the polling interval starts counting when all rotation requests are processed? Or it starts as soon as List SPCPC is invoked?
If it starts as soon as List SPCPC is invoked and add these task to workqueue, Does that mean new iteration will add more rotation task to the workqueue despite previous iteration didn't finish them and the queue just continue to pile up?

Which provider are you using:
Azure Key Vault.

Environment:

  • Secrets Store CSI Driver version: (use the image tag): v1.4.1
  • Kubernetes version: (use kubectl version): v1.27.9
@JasonXD-CS JasonXD-CS added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2024
@JasonXD-CS JasonXD-CS changed the title KeyVault 429 TooManyRequests led to infinite loop in workqueue without exponential backoff KeyVault 429 TooManyRequests led to infinite loop in workqueue Apr 1, 2024
@JasonXD-CS JasonXD-CS changed the title KeyVault 429 TooManyRequests led to infinite loop in workqueue KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue Apr 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2024
@Gil-Tohar-Forter
Copy link

Can this be prioritized, it doesn't make sense that a request is made to Vault for each pod, it should really be one per deployment, since all the pods for a given deployment are using the same secret.

With large scale services, sercret rotation becomes an unusable feature in fear of 429s from the secret provider or simply crashing the secret provider

@aramase
Copy link
Member

aramase commented Jul 25, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2024
@hakman
Copy link
Member

hakman commented Aug 30, 2024

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants