KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483
Labels
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/frozen
Indicates that an issue or PR should not be auto-closed due to staleness.
What steps did you take and what happened:
We have an Azure Key Vault with average requests lower than the KeyVault throttling limit.
and recently ran into outage when started using CSI Driver with auto rotation with 3 hours interval.
We had regular scale up at peak hours and that triggered Key Vault throttling and continue to be throttling for hours until we disable auto rotation.
As a result, none of the services could be created as it's stuck in ContainerCreating state and we had to revert back to use KeyVaultAgent.
After investigation we discovered a few issue with reconciler implementation:
Current auto rotation design is inefficient and not scalable as it's rotating secrets per pod, which makes a lot of unnecessary requests. If a deployment running 1000 replicas is downloading 10 secrets each. The amount of extra requests made is 10000 vs 10 if rotating per deployment.
According to this post Understanding how many calls are made to KeyVault?
However, workqueue does not handle 429 and no exponential backoff.
secrets-store-csi-driver/pkg/rotation/reconciler.go
Lines 401 to 405 in bf86dbf
So each task in the queue doesn't know anything about 429 throttling and it will just continue to process these requests in the workqueue without backoff. So it doesn't give any time for Key Vault recover as it continues to make these requests.
This is amplified when there are thousands of nodes pulling from the same KeyVault.
https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/bf86dbf98ad3e32a0f55e52d6a411abd6784f7fb/pkg/rotation/reconciler.go#L242C1-L243C1
What did you expect to happen:
Anything else you would like to add:
Regarding #1, the current design is not scalable.
Would love to hear from the team what's the plan for optimizing this going forward.
Also, does the polling interval starts counting when all rotation requests are processed? Or it starts as soon as List SPCPC is invoked?
If it starts as soon as List SPCPC is invoked and add these task to workqueue, Does that mean new iteration will add more rotation task to the workqueue despite previous iteration didn't finish them and the queue just continue to pile up?
Which provider are you using:
Azure Key Vault.
Environment:
kubectl version
): v1.27.9The text was updated successfully, but these errors were encountered: