Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of long URLs [draft] #11243

Closed
phantom943 opened this issue Apr 10, 2024 · 17 comments
Closed

Incorrect handling of long URLs [draft] #11243

phantom943 opened this issue Apr 10, 2024 · 17 comments
Labels
kind/support Categorizes issue or PR as a support question. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@phantom943
Copy link

phantom943 commented Apr 10, 2024

What happened:
I am getting intermittent HTTP errors when querying services via ingress, if the URL is longer than ~2000 symbols.
What you expected to happen:
Idempotency (no intermittent errors when querying the same stateless endpoint)
NGINX Ingress controller version:
v1.1.1
Kubernetes version (use kubectl version):

Environment:

  • Cloud provider or hardware configuration: Baremetal

  • OS (e.g. from /etc/os-release):

  • Kernel (e.g. uname -a):

  • Install tools:

    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:

    • kubectl version
    • kubectl get nodes -o wide
  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
    • If helm was used then please show output of helm -n <ingresscontrollernamespace> get values <helmreleasename>
    • If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
    • if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
  • Current State of the controller:

    • kubectl describe ingressclasses
    • kubectl -n <ingresscontrollernamespace> get all -A -o wide
    • kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
    • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
  • Current state of ingress object, if applicable:

    • kubectl -n <appnamespace> get all,ing -o wide
    • kubectl -n <appnamespace> describe ing <ingressname>
    • If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
  • Others:

    • Any other related information like ;
      • copy/paste of the snippet (if applicable)
      • kubectl describe ... of any custom configmap(s) created and in use
      • Any other related information that may help

How to reproduce this issue:
Do the same CURL GET request to a web service in k8s 30 times in a row. The URL has to be long (over 2000 characters). Observe the 200/error rate (in my case it's 505, but I suspect it might differ by application)
Anything else we need to know:
When I do the same CURL GET request to it 30 times in a row, ~10 times (30%) I get HTTP 505 error, and the other 20 times are 200 ok.
Some relevant info:

  • Web service logs are empty (even with the highest debug level)
  • nginx logs for both 200 and 505 case are identical (except for the response size). They both get routed to the same service.
  • If I do kubectl port-forward of my service' port to my machine - all 30 out of 30 requests complete with HTTP 200.
  • The URL I have is quite long (2500 symbols) (it's not me who designed the service, so please don't judge). If I truncate it to 1000 symbols - I get 30 out of 30 requests with HTTP 200. The actual cutoff where it starts breaking is about 1500-1700 symbols (letters and numbers).
@phantom943 phantom943 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 10, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Apr 10, 2024
@longwuyuan
Copy link
Contributor

/remove-kind bug

  • Why the word draft
  • Did yo umean URLs that are over 1600 chars in length
  • How does it work if you use plain old vanilla Nginx v1.25 as reverse-proxy instead of ingress-nginx

/kind support
/triage needs-information

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 10, 2024
@phantom943
Copy link
Author

Hello @longwuyuan !

  • The word draft just because I haven't finished filling in all the details (like how was the controller installed and whatnot). I am still finishing getting details from my colleagues
  • Yep, correct. URLs over 1600 chars in length
  • I am not sure if we can do that. I'll try, but that requires a lot of setup at this point.

@longwuyuan
Copy link
Contributor

ok. can you also describe why your URL is so long.

@phantom943
Copy link
Author

@longwuyuan well, it's a very bad choice by the end app developers - they are passing an OpenID token via a URL parameter -.-
Can't do much about that though

@longwuyuan
Copy link
Contributor

ok, thank you for the info. explains the use case

@longwuyuan
Copy link
Contributor

i am checking the nginx specs and HTTP specs. Maybe you can do the same. This project code will not set that limit for sure.

cc @tao12345666333 @rikatz if you already know the spec limit for a HTTP len(URL)

@longwuyuan
Copy link
Contributor

If you already have the complete error message from the controller logs, please copy/paste it here

@longwuyuan
Copy link
Contributor

Also controller v1.1.1 is not supported anymore. Is that the real version of hte controller in use ?

@phantom943
Copy link
Author

Hey @longwuyuan
thanks a lot for your suggestions!
We have indeed tried large-client-header-buffers, to no avail.
We have also just updated to the latest version of controller (v1.10.0), also with no effect - the bug still
I don't believe it's a matter of a setting, because otherwise it would have reproduced 100% of the time, not 30%.
Also interesting think to note is that this seems to be dependent on the application - I have another application where if I put a long token into the URL and just return it as a response, no 505 errors occur. So it could be a combination of a bug in the application and a bug in the ingress controller itself.
Do you have any suggestions on how to debug this maybe?

@longwuyuan
Copy link
Contributor

  • if you can run one request with 1800 chars in the URL and it does not fail, then I agree that this is not related to the large-client-header-buffer

  • I would next find a threshold at which the success rate begins to drop below 100%. Like use load generation tools and send incremental volumes but in batches.

    • 10 requests with 1800 chars in URL
    • 100 ditto
    • 500 ditto
    • and so on

@longwuyuan
Copy link
Contributor

Also, still waiting on the exact and real complete error message lines

@phantom943
Copy link
Author

Hello!
So, we have managed to locate the root issue. It turns out, the issue wasn't in the nginx controller after all.
The issue was packet fragmenting. The packet we were sending was bigger than the MTU of the machines on route.

So, sometimes the packet would arrive in full, and sometimes the packet would arrive in two fragments. The problem was with the server - in case of packet fragmenting, it received the first fragment and (because of a bug they have in the code) it started to interpret that half of the packet as the whole thing. The HTTP headers are actually located at the end of the packet, so they were in the second fragment. Clearly, the server failed to process the half of the packet correctly and just replied with 505 instead of waiting for the rest of it to arrive and process it in full.
Why did the packet get fragmented randomly 1/3 of the time we have no idea, but we managed to find a workaround.
We placed an additional nginx server in front of our target container on the same physical node, that would just relay the message to the correct port in the server. But nginx CAN handle fragmented packets correctly, and can reassemble these. So it reassembled the packet, and given that after that stage the packet didn't encounter any different machines - it arrived in full to the target server.

@longwuyuan
Copy link
Contributor

Thanks for the update.
If the apps in pods have listening sockets created by dev servers like jetty django other etc., it implies that folks did not put a nginx in front of it, inside the pod.

Your solutions sounds too anti-pattern to put a nginx webserver on node but since I am from outside, I would not know any better.

@phantom943
Copy link
Author

Thanks for the update. If the apps in pods have listening sockets created by dev servers like jetty django other etc., it implies that folks did not put a nginx in front of it, inside the pod.

Your solutions sounds too anti-pattern to put a nginx webserver on node but since I am from outside, I would not know any better.

Hey @longwuyuan
indeed you are right. They did not have any nginx or ASGI servers or anything - pure node.js server.
So we actually put the nginx into a side-car container in the same pod as the application, so hopefully not an anti-pattern.

@longwuyuan
Copy link
Contributor

thanks for updating. helps future readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests

3 participants