Load balancer IP cleared from all ingresses when upgrading nginx-ingress-controller #11087

alfredkrohmer · 2024-03-11T10:29:46Z

What happened:

When updating nginx-ingress-controller Helm chart to a new version (in this case: 4.9.1 to 4.10.0), the current leader pods logs these messages:

I0305 11:08:37.122872 7 nginx.go:379] "Shutting down controller queues"
I0305 11:08:37.137382 7 status.go:135] "removing value from ingress status" address=[{"ip":"10.1.2.3"}]
I0305 11:08:37.146479 7 status.go:304] "updating Ingress status" namespace="service-a-rc" ingress="service-a-rc" currentValue=[{"ip":"10.1.2.3"}] newValue=[]
I0305 11:08:37.146522 7 status.go:304] "updating Ingress status" namespace="service-b-rc" ingress="service-b-rc" currentValue=[{"ip":"10.1.2.3"}] newValue=[]
I0305 11:08:37.146593 7 status.go:304] "updating Ingress status" namespace="kube-system" ingress="kube-oidc-proxy" currentValue=[{"ip":"10.1.2.3"}] newValue=[]
I0305 11:08:37.146648 7 status.go:304] "updating Ingress status" namespace="monitoring" ingress="prometheus-k8s" currentValue=[{"ip":"10.1.2.3"}] newValue=[]
I0305 11:08:37.146907 7 status.go:304] "updating Ingress status" namespace="kube-system" ingress="dex" currentValue=[{"ip":"10.1.2.3"}] newValue=[]
I0305 11:08:37.146954 7 status.go:304] "updating Ingress status" namespace="monitoring" ingress="thanos-sidecar" currentValue=[{"ip":"10.1.2.3"}] newValue=[]
I0305 11:08:37.960259 7 nginx.go:387] "Stopping admission controller"
E0305 11:08:37.960342 7 nginx.go:326] "Error listening for TLS connections" err="http: Server closed"
I0305 11:08:37.960354 7 nginx.go:395] "Stopping NGINX process"
2024/03/05 11:08:37 [notice] 720#720: signal process started
I0305 11:08:39.013339 7 nginx.go:408] "NGINX process has stopped"
I0305 11:08:39.013371 7 sigterm.go:44] Handled quit, delaying controller exit for 10 seconds
I0305 11:08:49.013693 7 sigterm.go:47] "Exiting" code=0

This led to the load balancer IP (10.1.2.3) being removed from the status of all ingresses managed by the ingress controller, which in turn led to external-dns deleting all the DNS records for these ingresses, which caused an outage.

The newly elected leader from the updated deployment then put the load balancer IP back in the ingress status and external-dns recreated all the records.

This does not happen during normal pod restarts, only during version upgrades (we retroactively found the same logs during our last upgrade from 4.9.0 to 4.9.1).

What you expected to happen:

nginx-ingress-controller should not clear the status of ingresses when it shuts down during a version upgade.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
  Release:       v1.10.0
  Build:         71f78d49f0a496c31d4c19f095469f3f23900f8a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.3

Kubernetes version (use kubectl version): v1.27.2

Environment:

Cloud provider or hardware configuration: doesn't matter
OS (e.g. from /etc/os-release): doesn't matter
Kernel (e.g. uname -a): doesn't matter
Install tools: doesn't matter
Basic cluster related info: doesn't matter
How was the ingress-nginx-controller installed:

$ helm template --repo https://kubernetes.github.io/ingress-nginx ingress-nginx -f values.yaml | kubectl apply -f-

with values.yaml:

fullnameOverride: nginx-ingress-internal
controller:
  kind: Deployment
  containerName: nginx-ingress-controller
  electionID: nginx-ingress-controller-internal-leader
  ingressClass: nginx-internal
  ingressClassResource:
    controllerValue: k8s.goto.com/nginx-ingress-internal
    enabled: true
    ingressClassByName: true
    name: nginx-internal
  admissionWebhooks:
    certManager:
      enabled: true
  allowSnippetAnnotations: false
  enableAnnotationValidations: true
  config:
    strict-validate-path-type: true
    use-proxy-protocol: false
  dnsPolicy: Default
  minReadySeconds: 60
  extraArgs:
    shutdown-grace-period: 60
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
      scrapeInterval: 60s
  replicaCount: "2"
  resources:
    limits:
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 512Mi
  service:
    type: LoadBalancer
    externalTrafficPolicy: Local
    annotations:
      # cloud-specific annotations
      # ...
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                  - ingress-nginx
              - key: app.kubernetes.io/instance
                operator: In
                values:
                  - nginx-ingress-controller-internal
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - controller
          topologyKey: kubernetes.io/hostname
  topologySpreadConstraints:
    - labelSelector:
        matchLabels:
          app.kubernetes.io/component: controller
          app.kubernetes.io/instance: nginx-ingress-controller-internal
          app.kubernetes.io/name: ingress-nginx
      maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule

How to reproduce this issue:

Install some version of nginx-ingress-controller with Helm, using a service of type Load Balancer.
Upgrade or downgrade to a different version.
Observe the loglines above; potentially run kubectl get ingress -A -w in a background terminal to observe the load balancer IP being removed from all the ingresses manage by the controller.

Anything else we need to know:

The error message seems to be coming from here:

ingress-nginx/internal/ingress/status/status.go

Line 135 in 9c384c7

klog.InfoS("removing value from ingress status", "address", addrs)

This line is normally not reached when isRunningMultiplePods returns true:

ingress-nginx/internal/ingress/status/status.go

Lines 130 to 133 in 9c384c7

    
           if s.isRunningMultiplePods() { 
        
           	klog.V(2).InfoS("skipping Ingress status update (multiple pods running - another one will be elected as master)") 
        
           	return 
        
           }

Judging by the code of this function:

ingress-nginx/internal/ingress/status/status.go

Lines 238 to 252 in 9c384c7

    
           podLabel := make(map[string]string) 
        
           for k, v := range k8s.IngressPodDetails.Labels { 
        
           	if k != "pod-template-hash" && k != "controller-revision-hash" && k != "pod-template-generation" { 
        
           		podLabel[k] = v 
        
           	} 
        
           } 
        
           pods, err := s.Client.CoreV1().Pods(k8s.IngressPodDetails.Namespace).List(context.TODO(), metav1.ListOptions{ 
        
           	LabelSelector: labels.SelectorFromSet(podLabel).String(), 
        
           }) 
        
           if err != nil { 
        
           	return false 
        
           } 
        
           return len(pods.Items) > 1

it tries to find pods with the same labels as the currently running pod and only returns true if it finds at least one such pod. During a version upgrade, it is very likely that this return false because the Helm chart adds the version of the chart and the version of the ingress controller as labels to the pods, hence the new pods of the updated deployment are not considered anymore by this function.

The text was updated successfully, but these errors were encountered:

longwuyuan · 2024-03-11T15:52:48Z

@alfredkrohmer is this on a cloud -provider or on baremetal+metallb etc

The template of a new bug report asks questions that are the small tiny seeming unrelated and seeming irrelevant sometimes to some people
But that data is more often essential for actionable tasks
If you can help out and provide all that info, it would help readers

longwuyuan · 2024-03-11T15:53:57Z

/remove-kind bug

Lets add the bug label after a successful reproduce
//triage needs-information

longwuyuan · 2024-03-11T15:54:38Z

/triage needs-information

alfredkrohmer · 2024-03-11T18:03:32Z

This is on OCI (OKE) with a Network Load Balancer as a service, but as I pointed out with the code references, this seems to be generic behavior that triggers when a new version of the controller is rolled out with a different label set than the old pods.

longwuyuan · 2024-03-11T18:36:20Z

I saw it happen on minikube with metallb

just like another feature, this also impacts external-dns use-case

/triage accepted

I think it will be interesting to see if it happens when you set the LoadBalancer IP

ingress-nginx/charts/ingress-nginx/values.yaml

Line 469 in 9c384c7

loadBalancerIP: ""

(if OKE/OCI supports that field)

alfredkrohmer · 2024-03-12T07:45:36Z

OKE does support that field, but only for public IPs and unfortunately we don't have test clusters with public subnets available 🙁

alfredkrohmer · 2024-03-12T15:57:47Z

Not sure what the right approach to fix this would be. I can think of a couple of options:

exclude more label keys here, i.e. the labels containing the app and Helm chart versions
- this would bleed Helm chart specifics into the code base
- if ever a label would be removed from Helm chart later on, this would either break again during the upgrade or the removed label would need to be added to the code base first (which sounds wrong)
make the list of excluded label keys configurable
- the Helm chart would need to supply this list
provide a configuration flag to skip clearing the ingress status altogether
- if enabled by default in the Helm chart, uninstalling it wouldn't clear the status
- if disabled by default in the Helm chart, this would most likely still cause outages for most users on version upgrades if they don't explicitly enable it

DocX · 2024-03-22T16:03:46Z

Update: We realised we inject ec2-instance-id labels to all pods in our cluster, and so the controller pods always think they are the last one, and so runs the cleanup. So our issue is unrelated to specifically controller version update.

I can add +1 to this, we see this often, even when not updating, e.g.

liveness probe fail, pods are terminated and restarted.
node is broken or scaled down - needing to reschedule pods to different node, again triggers terminating
updating deployment (e.g. some config or our custom cluster labels updates)

We see the same symptoms everytime the controller pod is terminated:

endpoint is cleaned up on the ingress resource
external-dns removes the dns record
new ingress-controller pod starts
updates the endpoint on the ingress resources
external-dns readds the dns record

We run 2 replicas of the nginx-ingress-controller. We don't use Helm, we use the direct k8s manifest import.

I would expect that if one replica is terminated, as long as there is another replica that becomes leader, it should not update the statuses?

~~Update: we also see some errors around leader election during the shutting down, similar to #8996 - so it may be compounded issue.~~

Update: the leader election errors seem to be irrelevent.

robert-heinzmann-logmein · 2024-03-22T17:29:55Z

If the DNS is queried at the right time (when no record exists), the negative cache TTL might extend the issue beyond the actual recovery, which makes this worse I think.

longwuyuan · 2024-03-22T19:08:20Z

cc @strongjz @rikatz @tao12345666333

alfredkrohmer · 2024-04-17T19:09:49Z

Any idea / opinion on how to fix this?

longwuyuan · 2024-04-18T00:28:20Z

@alfredkrohmer any chance you can test setting the LoadBalancerIP in values file and update with behaviour

ingress-nginx/charts/ingress-nginx/values.yaml

Line 469 in 9c384c7

loadBalancerIP: ""

alfredkrohmer · 2024-04-29T08:27:59Z

This doesn't really work for us as we don't pick the load balancer IP statically. It's coming from the status of the service for which a backing load balancer in the cloud will be created.

longwuyuan · 2024-04-29T08:34:43Z

/assign

longwuyuan · 2024-04-29T08:36:14Z

/kind feature
/priority important-longterm

robert-heinzmann-logmein · 2024-06-19T20:18:10Z

Any new regarding this issue ?

alfredkrohmer · 2024-07-08T18:17:27Z

Oh wow, not sure why I haven't seen this before, but there is actually a flag to not remove the ingress status on shutdown 🤦

ingress-nginx/pkg/flags/flags.go

Line 138 in 886956e

updateStatusOnShutdown = flags.Bool("update-status-on-shutdown", true,

ingress-nginx/internal/ingress/status/status.go

Line 113 in 886956e

if !s.UpdateStatusOnShutdown {

Now I wonder if this should be set to false by default in the Helm chart.

Tadcas · 2024-10-18T06:22:28Z

We have the same issue when upgrading to 4.10.x or 4.11.x versions.
This led to the load balancer IP being removed from the status of all ingresses managed by the ingress controller, which in turn led to external-dns deleting all the DNS records for these ingresses.
We are using AKS - 1.29.8 version

hamza-louis · 2024-10-22T07:28:45Z

Hello there, I'm still facing this issue with multiple nginx ingress controller versions. I have tried the below ingress-nginx helm chart versions:

AWS EKS cluster - 1.30 version

It shows the below errors:

I have tried to change the permissions/scope of the ClusterRole and given it [create, get & update] but no luck

Any updates regarding this issue

Thanks in advance

longwuyuan · 2024-10-22T07:58:27Z

There is a deprecated field you can try to use https://kubernetes.io/docs/concepts/services-networking/service/#external-ips but it depends on the cloud-provider too.

The values file key:value pair is here

ingress-nginx/charts/ingress-nginx/values.yaml

Line 502 in c9d33b7

externalIPs: []

hamza-louis · 2024-10-22T08:06:15Z

@longwuyuan I'm using AWS and this field should be filled automatically by the DNS of the NLB. I can't put it statically

But the error is so misleading if this is the solution. In your opinion, is this issue related to tagging AWS resources ?

longwuyuan · 2024-10-22T08:25:12Z

There are 2 aspects that are impacting this.

One factor is that the controller code does not contain anything to do with the AWS LB. The controller creates a service of --type LoadBlaancer. That itself is a upstream Kubernetes Resource and not a CRD that we create.
The second factor is that the cloud-controller-manager will watch for events related to a service of --type LoadBalancer. So if a service of --type LoadBalaner is created, then the cloud-controller-manager will trigger events in the cloud-iinfrastructure provider

That field is the closest we have to influence your situation

longwuyuan · 2024-10-22T08:26:32Z

cc @Gacko as he sometimes tests on AWS. The project CI does not do any test on AWS and its also not easy to document and test all the possible migration path possibilities.

alfredkrohmer · 2024-11-14T10:09:23Z

@hamza-louis I think you got lost here. It looks like your problem is absolutely unrelated to what is being discussed in this issue.

sathieu · 2025-01-28T17:08:45Z

Not sure what the right approach to fix this would be. I can think of a couple of options:

Another solution would be to extract labels from the service selector. This is where the loadbalancer ip comes from anyway.

alfredkrohmer added the kind/bug Categorizes issue or PR as related to a bug. label Mar 11, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Mar 11, 2024

strongjz added this to [SIG Network] Ingress NGINX Mar 11, 2024

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 11, 2024

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 11, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2024

k8s-ci-robot assigned longwuyuan Apr 29, 2024

sathieu mentioned this issue Jan 28, 2025

2025 Issues kubitus-project/upstream-issues#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancer IP cleared from all ingresses when upgrading nginx-ingress-controller #11087

Load balancer IP cleared from all ingresses when upgrading nginx-ingress-controller #11087

alfredkrohmer commented Mar 11, 2024

longwuyuan commented Mar 11, 2024

longwuyuan commented Mar 11, 2024

longwuyuan commented Mar 11, 2024

alfredkrohmer commented Mar 11, 2024

longwuyuan commented Mar 11, 2024 •

edited

Loading

alfredkrohmer commented Mar 12, 2024

alfredkrohmer commented Mar 12, 2024

DocX commented Mar 22, 2024 •

edited

Loading

robert-heinzmann-logmein commented Mar 22, 2024

longwuyuan commented Mar 22, 2024

alfredkrohmer commented Apr 17, 2024

longwuyuan commented Apr 18, 2024

alfredkrohmer commented Apr 29, 2024

longwuyuan commented Apr 29, 2024

longwuyuan commented Apr 29, 2024

robert-heinzmann-logmein commented Jun 19, 2024

alfredkrohmer commented Jul 8, 2024 •

edited

Loading

Tadcas commented Oct 18, 2024

hamza-louis commented Oct 22, 2024

longwuyuan commented Oct 22, 2024 •

edited

Loading

hamza-louis commented Oct 22, 2024

longwuyuan commented Oct 22, 2024

longwuyuan commented Oct 22, 2024

alfredkrohmer commented Nov 14, 2024

sathieu commented Jan 28, 2025

Load balancer IP cleared from all ingresses when upgrading nginx-ingress-controller #11087

Load balancer IP cleared from all ingresses when upgrading nginx-ingress-controller #11087

Comments

alfredkrohmer commented Mar 11, 2024

longwuyuan commented Mar 11, 2024

longwuyuan commented Mar 11, 2024

longwuyuan commented Mar 11, 2024

alfredkrohmer commented Mar 11, 2024

longwuyuan commented Mar 11, 2024 • edited Loading

alfredkrohmer commented Mar 12, 2024

alfredkrohmer commented Mar 12, 2024

DocX commented Mar 22, 2024 • edited Loading

robert-heinzmann-logmein commented Mar 22, 2024

longwuyuan commented Mar 22, 2024

alfredkrohmer commented Apr 17, 2024

longwuyuan commented Apr 18, 2024

alfredkrohmer commented Apr 29, 2024

longwuyuan commented Apr 29, 2024

longwuyuan commented Apr 29, 2024

robert-heinzmann-logmein commented Jun 19, 2024

alfredkrohmer commented Jul 8, 2024 • edited Loading

Tadcas commented Oct 18, 2024

hamza-louis commented Oct 22, 2024

longwuyuan commented Oct 22, 2024 • edited Loading

hamza-louis commented Oct 22, 2024

longwuyuan commented Oct 22, 2024

longwuyuan commented Oct 22, 2024

alfredkrohmer commented Nov 14, 2024

sathieu commented Jan 28, 2025

longwuyuan commented Mar 11, 2024 •

edited

Loading

DocX commented Mar 22, 2024 •

edited

Loading

alfredkrohmer commented Jul 8, 2024 •

edited

Loading

longwuyuan commented Oct 22, 2024 •

edited

Loading