Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readinessProbe endpoint 10254:/healthz as AWS NLB health check endpoint and using proxy protocol v2 (PPv2) fails #10982

Closed
youwalther65 opened this issue Feb 12, 2024 · 15 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@youwalther65
Copy link

youwalther65 commented Feb 12, 2024

What happened:

I am running NGINX Ingress Controller (for demo purposes with just one pod) in an AWS environment behind an NLB with target group in IP mode.
When using readinessProbe endpoint 10254:/healthz as AWS NLB health check endpoint and using proxy protocol v2 (PPv2) NLB IP target went into "Unhealthy" state (even if corresponding pod is Ready).
Without the usage of proxy protocol v2 (PPv2) i.e by removing service annotation service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*' everything works fine and NLB IP target is healthy

What you expected to happen:

I expect readinessProbe endpoint 10254:/healthz to support PPv2 as well and NLB IP targets should be "Healthy"

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

$ helm list -n ingress-nginx 
NAME         	NAMESPACE    	REVISION	UPDATED                                	STATUS  	CHART              	APP VERSION
ingress-nginx	ingress-nginx	1       	2024-02-10 11:46:12.697149101 +0000 UTC	deployed	ingress-nginx-4.9.1	1.9.6 

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0-eks-c417bb3

Environment:

  • Cloud provider or hardware configuration: AWS EKS with AWS Load Balancer Controller
$ helm list -n kube-system  | grep aws-lb-ctlr
aws-lb-ctlr                   	kube-system	3       	2024-02-09 19:20:46.107688341 +0000 UTC	deployed	aws-load-balancer-controller-1.7.1         	v2.7.1     
  • How was the ingress-nginx-controller installed:
$ helm get values -n ingress-nginx ingress-nginx
USER-SUPPLIED VALUES:
controller:
  autoscaling:
    enabled: false
    maxReplicas: 4
    minReplicas: 2
  config:
    use-proxy-protocol: "true"
  metrics:
    enabled: true
    port: 10254
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  resources:
    requests:
      cpu: 100m
      memory: 200Mi
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: auto-delete=no
      service.beta.kubernetes.io/aws-load-balancer-attributes: load_balancing.cross_zone.enabled=true
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /healthz
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "10254"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-success-codes: 200-299
      service.beta.kubernetes.io/aws-load-balancer-name: nginx-ingress-nlb
      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
      service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
      service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
      service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.timeout_seconds=270
      service.beta.kubernetes.io/aws-load-balancer-type: external
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/instance: ingress-nginx
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  - labelSelector:
      matchLabels:
        app.kubernetes.io/instance: ingress-nginx
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
  • Current State of the controller:
$ k get all -n ingress-nginx 

NAME                                          READY   STATUS    RESTARTS   AGE
pod/ingress-nginx-controller-b78b6d44-6qdbv   1/1     Running   0          2d3h

NAME                                         TYPE           CLUSTER-IP       EXTERNAL-IP                                                      PORT(S)                      AGE
service/ingress-nginx-controller             LoadBalancer   172.20.209.179   nginx-ingress-nlb-<redacted>.elb.<redacted>.amazonaws.com   80:30174/TCP,443:31214/TCP   2d3h
service/ingress-nginx-controller-admission   ClusterIP      172.20.204.172   <none>                                                           443/TCP                      2d3h
service/ingress-nginx-controller-metrics     ClusterIP      172.20.212.123   <none>                                                           10254/TCP                    2d3h

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ingress-nginx-controller   1/1     1            1           2d3h

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/ingress-nginx-controller-b78b6d44   1         1         1       2d3h
@youwalther65 youwalther65 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

  • Post all the log messages of the LB from cloudtrail/other
  • Post all the log messages of the controller pod
  • Post all the output of k get events -A
  • Ask aws support as well
  • For debugging you can also try this. Uninstall the helm chart and use the manifest as explained in the deployment guide to install, after editing the manifest for the required aws annotations and the configmap for proxy-protocol enabling. Then report the difference

/triage needs-information
/help

If anyone has resources to test, it will help if they can reproduce this with standard vanilla install (plus just enabling proxy-protocol) because the project CI does not test against clouds like AWS and also does not have a free aws account

@k8s-ci-robot
Copy link
Contributor

@longwuyuan:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

  • Post all the log messages of the LB from cloudtrail/other
  • Post all the log messages of the controller pod
  • Post all the output of k get events -A
  • Ask aws support as well
  • For debugging you can also try this. Uninstall the helm chart and use the manifest as explained in the deployment guide to install, after editing the manifest for the required aws annotations and the configmap for proxy-protocol enabling. Then report the difference

/triage needs-information
/help

If anyone has resources to test, it will help if they can reproduce this with standard vanilla install (plus just enabling proxy-protocol) because the project CI does not test against clouds like AWS and also does not have a free aws account

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 13, 2024
@youwalther65
Copy link
Author

youwalther65 commented Feb 13, 2024

Before further testing/debugging @longwuyuan can someone confirm that my approach should work? Can you confirm that you expect 10254 port to support PPv2 as well?

@longwuyuan
Copy link
Contributor

I am not able to comment on proxy-protocol-verson-2 support

cc @tao12345666333 @rikatz @cpanato @strongjz

@strongjz
Copy link
Member

we use nginx 1.21 so proxy protocol v2 should work

To accept the PROXY protocol v2, NGINX Plus R16 and later or NGINX Open Source 1.13.11 and later

https://docs.nginx.com/nginx/admin-guide/load-balancer/using-proxy-protocol/

Is the port open between the node group and the load balancer in the security group? I'm not sure the cloud controller does that by default when you put it in the annotation.

@youwalther65
Copy link
Author

youwalther65 commented Feb 15, 2024

@strongjz Valid point. But yes , it is. And just manually turning of PPv2 on NLB targtetgroup for NGINX Ingress Ctlr port will make target pod healthy again (so SG is not an issue at all).
From my limited Go experience I know that application port and health check port need to listen on a PPv2 enabled listener. Because port 10254 of NGINX Ingress ctrl is used for metric scraping as well, I just wonder if this is really the case.

$ kubectl get deploy -n ingress-nginx ingress-nginx-controller  -o yaml
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          name: https
          protocol: TCP
        - containerPort: 10254
          name: metrics
          protocol: TCP
        - containerPort: 8443
          name: webhook
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
...

@strongjz
Copy link
Member

/assign @Gacko

@Gacko
Copy link
Member

Gacko commented Mar 6, 2024

Hello @youwalther65!

As you already assume, the health check port of the Ingress NGINX pod is sadly not using nor supporting PROXY protocol.

If I get your output correctly, you are using the AWS Load Balancer Controller. I also already noted that it is enabling PROXY protocol for the health check port if it is configured for the traffic port. If I remember correctly, that wasn't the case for the legacy in-tree controller, but I need to check that.

Basically it shouldn't make a difference whether you're using the IP target mode, the instance target mode or even externalTrafficPolicy: Local as the AWS Load Balancer Controller always configures the health check port to use PROXY protocol if it is configured for the traffic port and neither Ingress NGINX's, which is used for both IP and instance target mode, nor kube-proxy or its replacements like Cilium do support PROXY protocol on their health check port.

Lemme do some testing around that. I'll come back later!

Regards
Marco

@Gacko
Copy link
Member

Gacko commented Mar 6, 2024

Side note: According to your profile you're working for AWS. So maybe you even have more insights regarding NLB capabilities and configuration than me. 😬

@Gacko
Copy link
Member

Gacko commented Mar 8, 2024

Hello @youwalther65!

I just played around with the AWS Load Balancer Controller and NLBs a bit and came to an interesting result. Heads up: I know you're using IP mode, but please keep on reading, I'll get back to that in the end.

When you created a CLB using the In-Tree Load Balancer Controller in the past, you could have your traffic flow through node ports using PROXY protocol while the health check on the health check node port (externalTrafficPolicy: Local) was still plain HTTP. In case you were using externalTrafficPolicy: Cluster, Kubernetes did not allocate such a health check node port and configured your CLB to just use the according traffic ports, so the node ports. Interestingly AWS then automatically also used PROXY protocol for the health check.

So in the end the CLB was more or less always using the right protocol independent of wether the health check is handled by kube-proxy, which is not supporting PROXY protocol, for implementing externalTrafficPolicy: Local or it is handled by Ingress NGINX itself in case of externalTrafficPolicy: Cluster, which requires PROXY protocol when it is enabled. Basically... magic happens.

This magic is not happening anymore with NLBs. I tried to deploy an NLB using the In-Tree Load Balancer Controller with PROXY protocol enabled. That didn't work as the old controller doesn't seem to support PROXY protocol for NLBs. At least adding the service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*" didn't result in a NLB with PROXY protocol enabled in the AWS console.

As this was just a test, I went on to configure a NLB in the recommended way of using the AWS Load Balancer Controller. As you can probably already imagine, this controller correctly setup a NLB with PROXY protocol as requested. But compared to the CLB created by the In-Tree Load Balancer Controller, this one was using PROXY protocol for both the traffic ports and the health check port. So if you also configure externalTrafficPolicy: Local, the NLB will end up health checking kube-proxy with PROXY protocol which then fails.

So you can either turn off PROXY protocol to make the health check work again while using externalTrafficPolicy: Local (fun fact: client IPs are still being preserved this way as long as client IP preservation isn't turned off in the NLB) or switch over to externalTrafficPolicy: Cluster, which then really requires PROXY protocol as it does not preserve the client IP, even with NLB client IP preservation enabled.

As you're using the IP mode in your particular use case, this means you either use the traffic ports for the health check and do not change it to the metrics port (which also does not support PROXY protocol) when using PROXY protocol or you turn off PROXY protocol when using any other port than the traffic ports.

Actually I'm wondering why you're even using a different port for the health checks. Your NLB has two listeners, one for each traffic port (HTTP & HTTPS) and each of them has its own target group. This means each target group can use the right traffic port (HTTP for HTTP, HTTPS for HTTPS) for its own health check. There is no need to override it to use the metrics port from my point of view. So in the end it shouldn't make a difference whether you're using PROXY protocol in IP mode as long as you're using the traffic ports, because IP mode doesn't rely on any (health check) node ports at all.

Maybe you can bring some light into your use case and explain why you want to use the metrics port for health checks.

Regards
Marco

@Gacko
Copy link
Member

Gacko commented Mar 8, 2024

Hello @youwalther65!

As you already assume, the health check port of the Ingress NGINX pod is sadly not using nor supporting PROXY protocol.

If I get your output correctly, you are using the AWS Load Balancer Controller. I also already noted that it is enabling PROXY protocol for the health check port if it is configured for the traffic port. If I remember correctly, that wasn't the case for the legacy in-tree controller, but I need to check that.

Basically it shouldn't make a difference whether you're using the IP target mode, the instance target mode or even externalTrafficPolicy: Local as the AWS Load Balancer Controller always configures the health check port to use PROXY protocol if it is configured for the traffic port and neither Ingress NGINX's, which is used for both IP and instance target mode, nor kube-proxy or its replacements like Cilium do support PROXY protocol on their health check port.

Lemme do some testing around that. I'll come back later!

Regards Marco

Some addition to that: I was going in the assumption that we are using 10254 for health checking Ingress NGINX. Basically that is true, but only for Kubernetes itself. So Kubelet is using that port to determine if a pod is ready and healthy, but that's not the port being used by Load Balancers. Those use either the traffic ports or the kube-proxy provided health check port depending on your value of externalTrafficPolicy.

So to bring it back to your use case: Please use the traffic ports, especially when using PROXY protocol, as the "health check port" 10254 does not support PROXY protocol and load balancers normally do the same.

@Gacko
Copy link
Member

Gacko commented Mar 11, 2024

Closing this for now. Re-open if the information provided does not solve your issue or you see a way of improving the current behavior (I do not as it's highly provider specific).

/close

@k8s-ci-robot
Copy link
Contributor

@Gacko: Closing this issue.

In response to this:

Closing this for now. Re-open if the information provided does not solve your issue or you see a way of improving the current behavior (I do not as it's highly provider specific).

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@youwalther65
Copy link
Author

@Gacko Thank you for investigation and providing the details. At the end it was basically my assumption/misunderstanding to make use of readinessProbe endpoint 10254:/healthz as AWS NLB health check endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests

5 participants