-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client.Timeout exceeded
(30s) on validation webhooks when updating Ingress objects
#11255
Comments
This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is it possible for you to get logs and events from those 20-30 secnds |
/remove-kind bug |
Excluding lines containing:
I'm left with these for the time period as mention above + 15 sec (09:29:26 - 09:29:59):
But for completeness sake, here is the full dump (redacted for business info, and all request logs are excluded): nginx-logs-ing-update-2024-04-15 09 33 02.414.txt |
There are lots of ssl errors and if you are saying that you submit lots of new ingress json to the api-server and the controller pod needs to reconcile & sync while there is resource crunch, then the stats could be the impact |
No, we have fairly low utilization on all node. These are the three nodes from the same point (marked with the arrows) in time as the previous mentioned test. CPU to the left, memory to the right.
Yes, full deployment manifest here.
Events that make nginx sync data can be seen here, my test from earlier is marked, a period of low «sync volume». Yes, there are a unfortunate amount of SSL errors. Zooming out, these are rather constant and I don't feel like they should affect the validation of my config to the point it take 20s. Following the same argument, there should be periods of low resource crunch where we should see validation duration of only a few seconds, but we don't. I can try and eliminate the SSL errors before going further to get rid of this suspicion. |
I want to give enough time to this so we can get the root cause. If you ping me on slack we could see how to do that. |
We had a video chat and ran through some config and settings to validate the «normal stuff». We SSH into the node of the master nginx pod to look for clues in dmesg logs, especially related to connection tracking ( |
We have a I tried package dumping on our nginx pods there, and it seems like all time is spent inside nginx.
TCP and TLS handshake happens immidiatly, and then there is a 7s holdup waiting for a response from nginx on the admission webhook port 8443. This tells me this is not a network problem. |
Thanks. This data is proof that when you say "network", you really mean But as I hinted on the video call, the truly useful information for this is to to know if a test like |
Before going deeper down the network route, i came across a scent of what i think is the smoking gun: adding lots of paths seem to correlate well with the validation times we are seeing. More path items within one of our ingresses, produces longer validation times. If paths are spread out over multiple ingresses, it seems to be okay, but cramping many paths into one ingress seem to explode. Our
If i pick up on of our ingresses, not even a special one, just one of them, edit a label and apply it – i see the usual 7s wait in our
Here i go from 2 paths:
To 21 paths:
The validation webhook becomes super slow:
Adding more paths makes it go over Kubernetes' admission webhook max timeout of 30s and With the object above applied, you can see how other ingress updates to the same nginx-server also is affected, presumably because the validation webhook makes nginx render all ingress objects and validate the full config each time (rightfully so):
In total, this does not generate much more config, it is just a 2% increase in nginx location directives:
The full ingress manifests, and the diff between them. I can try and steps to reproduce this locally. |
Awesome data. |
I've figured out where this comes from! The added latency on the admission controller webhook is triggered with the combination of the annotation
For our production cluster, we run 270 ingresses, and 4 of these had this annotation, and most of them also had a good handful of paths. Removing the annotation, mitigated our problem. I've also reproduce this locally, if you are interested in the config, commands and manifests. Moving this (and the corresponding modsecurity flag) to the main configmap works fine though, it does not add any additional latency. We did not do this though as we have a WAF infront of our nginx-ingress that does the same job.
Profit. 🌮 |
Thank you very much for the update. Awesome work. Love this kind of results :-) cheers If no more an issue, kindly consider closing the issue. Once again, awesome and thank a lot. |
I do think I'd classify this as a regression bug; while more config obviously add validation time, the amount of validation time that we've seen here is unreasonably high compared to the amount config that actually gets added. I'm thinking I can try and profile where the time is spent, and maybe that can be used as input for some future optimizations. I'll close this if that does not yield and good results - or if no one else picks this up within reasonable time. Thanks for your help and guidance, @longwuyuan. |
There is painful & intensive work in progress that relates to this kind of circumstances. One such effort is a PR trying to split the data-plane away from the control-plane. That paradigm shift introduces improved security, performance and hopefully management. If you provide the data that you hinted, it can only do good for the developer decisions. The action item arising from that data is not someting I can comment on because the complications are just too much to summarise in writing. Thank you very much for your info |
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach |
See this same issue in our clusters, assume setting these globally is a decent solution? This is directly related and is only solved by messing with readyness and liveness probes: #11689 |
Excuse me, does your nginx use remote storage? like nfs. |
What happened:
We continue to hit the (max) timeout on our validation webhook when applying ingress manifests.
failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=30s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
It is consistently high, in the 20s mark, while general load or ingress applies in quick succession might push it to 30s where deploy pipelines start to fail.
The above image shows a graph of validation time, metric given by nginx itself, over 24 hours earlier this week.
This is me adding a label, to illustrate one simple update:
This is in a medium sized cluster,
server_names
, 4778locations
)Request rate
Over the same time period as above.
Performance of pods
To comment on this, it looks and feels quite bearable. Spikes in CPUs are assumed to be nginx reloads and validations runs. Over the same time period as above.
90 days trends:
The image above show the number of ingresses over the last 90 days.
The image above shows the validation webhook duration over the last 90 days. This mostly support an organic growth of sorts, except the the quick changed marked in the picture above; this has been tracked down to 10 ingresses (serving the same host) that changed from 1 host to 3 so the collection of ~60 paths over 1 host became ~180 over 3 hosts.
See an example of such ingress post change
What you expected to happen:
I've seen people mention far better performance then 20-30s on their validation webhook in other issues around here, and that with larger clusters and larger nginx config files. So my expectations would be in the 1-5s mark.
This PR will probably help us in the cases where multiple ingresses at the same time gets applied - but one or a few single applies should probably not take 20s?
NGINX Ingress controller version
nginx/1.21.6, release v1.9.5
Kubernetes version (use
kubectl version
):torvald@surdeig ~ $ kubectl version --short Client Version: v1.25.0 Kustomize Version: v4.5.7 Server Version: v1.27.10-gke.1055000
Environment:
uname -a
): 5.15.133+kubectl get -n ingress all -o wide
See an example of an ingress, the same as mentioned above in the «What happened» section.
How to reproduce this issue:
I think this would be unfeasible, but I'm happy to assist with more details.
The text was updated successfully, but these errors were encountered: