-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load balancer IP cleared from all ingresses when upgrading nginx-ingress-controller #11087
Comments
@alfredkrohmer is this on a cloud -provider or on baremetal+metallb etc
|
/remove-kind bug
|
/triage needs-information |
This is on OCI (OKE) with a Network Load Balancer as a service, but as I pointed out with the code references, this seems to be generic behavior that triggers when a new version of the controller is rolled out with a different label set than the old pods. |
I saw it happen on minikube with metallb just like another feature, this also impacts external-dns use-case /triage accepted I think it will be interesting to see if it happens when you set the LoadBalancer IP
|
OKE does support that field, but only for public IPs and unfortunately we don't have test clusters with public subnets available 🙁 |
Not sure what the right approach to fix this would be. I can think of a couple of options:
|
Update: We realised we inject I can add +1 to this, we see this often, even when not updating, e.g.
We see the same symptoms everytime the controller pod is terminated:
We run 2 replicas of the nginx-ingress-controller. We don't use Helm, we use the direct k8s manifest import. I would expect that if one replica is terminated, as long as there is another replica that becomes leader, it should not update the statuses?
Update: the leader election errors seem to be irrelevent. |
If the DNS is queried at the right time (when no record exists), the negative cache TTL might extend the issue beyond the actual recovery, which makes this worse I think. |
Any idea / opinion on how to fix this? |
@alfredkrohmer any chance you can test setting the LoadBalancerIP in values file and update with behaviour
|
This doesn't really work for us as we don't pick the load balancer IP statically. It's coming from the status of the service for which a backing load balancer in the cloud will be created. |
/assign |
/kind feature |
Any new regarding this issue ? |
Oh wow, not sure why I haven't seen this before, but there is actually a flag to not remove the ingress status on shutdown 🤦 ingress-nginx/pkg/flags/flags.go Line 138 in 886956e
Now I wonder if this should be set to false by default in the Helm chart.
|
We have the same issue when upgrading to 4.10.x or 4.11.x versions. |
Hello there, I'm still facing this issue with multiple nginx ingress controller versions. I have tried the below ingress-nginx helm chart versions: AWS EKS cluster - 1.30 version It shows the below errors: I have tried to change the permissions/scope of the ClusterRole and given it [create, get & update] but no luck Any updates regarding this issue Thanks in advance |
There is a deprecated field you can try to use https://kubernetes.io/docs/concepts/services-networking/service/#external-ips but it depends on the cloud-provider too. The values file key:value pair is here
|
@longwuyuan I'm using AWS and this field should be filled automatically by the DNS of the NLB. I can't put it statically But the error is so misleading if this is the solution. In your opinion, is this issue related to tagging AWS resources ? |
There are 2 aspects that are impacting this.
That field is the closest we have to influence your situation |
cc @Gacko as he sometimes tests on AWS. The project CI does not do any test on AWS and its also not easy to document and test all the possible migration path possibilities. |
@hamza-louis I think you got lost here. It looks like your problem is absolutely unrelated to what is being discussed in this issue. |
Another solution would be to extract labels from the service selector. This is where the loadbalancer ip comes from anyway. |
What happened:
When updating nginx-ingress-controller Helm chart to a new version (in this case: 4.9.1 to 4.10.0), the current leader pods logs these messages:
This led to the load balancer IP (
10.1.2.3
) being removed from the status of all ingresses managed by the ingress controller, which in turn led to external-dns deleting all the DNS records for these ingresses, which caused an outage.The newly elected leader from the updated deployment then put the load balancer IP back in the ingress status and external-dns recreated all the records.
This does not happen during normal pod restarts, only during version upgrades (we retroactively found the same logs during our last upgrade from 4.9.0 to 4.9.1).
What you expected to happen:
nginx-ingress-controller should not clear the status of ingresses when it shuts down during a version upgade.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
Kubernetes version (use
kubectl version
): v1.27.2Environment:
Cloud provider or hardware configuration: doesn't matter
OS (e.g. from /etc/os-release): doesn't matter
Kernel (e.g.
uname -a
): doesn't matterInstall tools: doesn't matter
Basic cluster related info: doesn't matter
How was the ingress-nginx-controller installed:
with
values.yaml
:How to reproduce this issue:
kubectl get ingress -A -w
in a background terminal to observe the load balancer IP being removed from all the ingresses manage by the controller.Anything else we need to know:
The error message seems to be coming from here:
ingress-nginx/internal/ingress/status/status.go
Line 135 in 9c384c7
This line is normally not reached when
isRunningMultiplePods
returnstrue
:ingress-nginx/internal/ingress/status/status.go
Lines 130 to 133 in 9c384c7
Judging by the code of this function:
ingress-nginx/internal/ingress/status/status.go
Lines 238 to 252 in 9c384c7
it tries to find pods with the same labels as the currently running pod and only returns true if it finds at least one such pod. During a version upgrade, it is very likely that this return
false
because the Helm chart adds the version of the chart and the version of the ingress controller as labels to the pods, hence the new pods of the updated deployment are not considered anymore by this function.The text was updated successfully, but these errors were encountered: