Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent request aborted from proxied service #235

Open
dipack95 opened this issue Feb 27, 2025 · 6 comments
Open

Intermittent request aborted from proxied service #235

dipack95 opened this issue Feb 27, 2025 · 6 comments

Comments

@dipack95
Copy link
Contributor

Hi Andy -- we've been running Piko in Prod for a couple of weeks now, and have noticed that the services proxied by Piko will sometimes throw request aborted errors when responding to requests.

The error message is:

External BadRequestError: BadRequestError: request aborted
    at IncomingMessage.onAborted (/main/.yarn/cache/raw-body-npm-2.5.1-9dd1d9fff9-280bedc12d.zip/node_modules/raw-body/index.js:238:10)
    at IncomingMessage.emit (node:events:517:28)
    at IncomingMessage.emit (node:domain:489:12)
    at IncomingMessage._destroy (node:_http_incoming:224:10)
    at _destroy (node:internal/streams/destroy:109:10)
    at IncomingMessage.destroy (node:internal/streams/destroy:71:5)
    at abortIncoming (node:_http_server:781:9)
    at socketOnClose (node:_http_server:775:3)
    at TLSSocket.emit (node:events:529:35)
    at TLSSocket.emit (node:domain:489:12)
    at node:net:350:12
    at Socket.done (node:_tls_wrap:657:7)
    at Object.onceWrapper (node:events:632:26)
    at Socket.emit (node:events:517:28)
    at Socket.emit (node:domain:489:12)
    at TCP.<anonymous> (node:net:350:12)
    at TCP.callbackTrampoline (node:internal/async_hooks:128:17)

Our setup is:

  1. Service to be proxied running in ECS, with the Piko agent running as a process in the same container
  2. 2 or more instances of the same service run at any given time in our ECS service, and they all use the same endpoint-id to connect to our Piko cluster (hosted on EKS)
  3. Our backend attempts to contact the service through the Piko cluster, and about 5-10% of all requests end in request aborted issues
    1. We've noticed that while the NodeJS service throws a request aborted status message, with an error code of 400, Piko converts that into a 502 upstream unreachable when reading the response from the Piko clsuter -- presumably due to this code

Things we've tried:

  1. Confirmed that request headers and content-types are respected, including using Accept-Encoding: identity to prevent the Golang builtin http transporter from messing around with response types
  2. Increased the connection idle timeout on the Piko cluster's ingress to 120 seconds
  3. Confirmed that our services proxied by the Piko agent, nor the Piko agent itself, are dying, and are still up and running when the requests are abruptly aborted

Do you have any thoughts around why such a thing might be happening? Please let me know if you'd like more details!

@dipack95 dipack95 changed the title Intermittent errors from proxied service Intermittent request aborted from proxied service Feb 27, 2025
@andydunstall
Copy link
Owner

andydunstall commented Feb 28, 2025

Hey @dipack95, I'm not sure what "request aborted" means. Are you sure the upstream service responds with a valid 400 response to the agent? Rather than closing the connection or something, such that the agent never receives the response?

The Piko agent doesn't care what status code the upstream returns, so as long as it received a valid HTTP response it will be routed back to the client.

What do the agent logs show? You linked to this, which is only called if a network error occurs during the HTTP request, and logs what the error is (as a warning). Also can you check whether the logs say the agent was disconnected from the cluster at any point

@dipack95
Copy link
Contributor Author

dipack95 commented Feb 28, 2025

I'm not sure what "request aborted" means. Are you sure the upstream service responds with a valid 400 response to the agent? Rather than closing the connection or something, such that the agent never receives the response?

Yeah what I think is happening here is that the connection between the service and the Piko agent is being broken such that the Piko agent never receives the response from downstream, hence the upstream unreachable that Piko agent responds with to the caller.

Piko agent log for `request aborted` response from service
{
      "msg": "request",
      "request": {
        "duration": "33.47156ms",
        "path": "redacted",
        "response_headers": {
          "X-Content-Type-Options": [
            "nosniff"
          ],
          "Content-Type": [
            "application/json"
          ]
        },
        "method": "POST",
        "proto": "HTTP/1.1",
        "host": "piko-cluster-url",
        "request_headers": {
          "X-Transcend-Token": [
            "redacted"
          ],
          "Accept": [
            "application/json"
          ],
          "X-Datadog-Parent-Id": [
            "redacted"
          ],
          "Connection": [
            "close"
          ],
          "X-Datadog-Trace-Id": [
            "redacted"
          ],
          "X-Forwarded-Proto": [
            "https"
          ],
          "X-Piko-Authorization": [
            "redacted"
          ],
          "Accept-Encoding": [
            "gzip"
          ],
          "X-Data-Silo-Type": [
            "redacted"
          ],
          "X-Forwarded-Port": [
            "8000"
          ],
          "X-Amzn-Trace-Id": [
            "redacted"
          ],
          "X-Sombra-Request-Id": [
            "redacted"
          ],
          "X-Piko-Forward": [
            "true"
          ],
          "X-Datadog-Sampling-Priority": [
            "2"
          ],
          "X-Sombra-Organization-Id": [
            "redacted"
          ],
          "X-Forwarded-For": [
            "redacted"
          ],
          "Content-Length": [
            "2868"
          ],
          "X-Request-Type": [
            "redacted"
          ],
          "X-Sombra-Organization-Uri": [
            "redacted"
          ],
          "Content-Type": [
            "application/json"
          ],
          "X-Data-Silo-Id": [
            "redacted"
          ],
          "X-Piko-Endpoint": [
            "redacted"
          ]
        },
        "status": 502
      },
      "level": "Error",
      "service": "service-name",
      "subsystem": "proxy.http.access",
      "source": "stderr",
      "timestamp": 1740762834954,
      "ts": "2025-02-28T17:13:54.954Z",
      "endpoint-id": "redacted"
}

There's also these really interesting log lines that are logged just before the log line above:

[
  {
    "msg": "proxy request",
    "level": "Error",
    "service": "redacted",
    "subsystem": "proxy.http",
    "source": "stderr",
    "error": "EOF",
    "timestamp": 1740762834954,
    "ts": "2025-02-28T17:13:54.953Z",
    "endpoint-id": "redacted"
  },
  {
    "log": "",
    "level": "Error",
    "service": "redacted",
    "source": "stderr",
    "timestamp": 1740762834954
  }
]

Our agents are not configured to log any specific subsystems, so if there's some you recommend enabling, that aren't already enabled by default, please let me know

Also can you check whether the logs say the agent was disconnected from the cluster at any point

I've checked the logs, and it has not logged the upstream disconnected warning in that timeframe, leading me to think that the Piko agent is still alive and able to contact the cluster itself.

We've enabled the following --log.subsystems:

[
        'admin',
        'cluster',
        'proxy.access',
        'proxy.http',
        'proxy.tcp',
        'reporter',
        'server',
        'upstream',
]

Attached is the warning log from the Piko cluster for the same request

Piko Cluster log for the same request
{
    "host": "host-name",
    "service": "service-name",
    "message": "request",
    "attributes": {
      "request": {
        "duration": "35.620829ms",
        "path": "redacted",
        "response_headers": {
          "X-Content-Type-Options": [
            "nosniff"
          ],
          "Content-Length": [
            "33"
          ],
          "Date": [
            "Fri, 28 Feb 2025 17:13:54 GMT"
          ],
          "Content-Type": [
            "application/json"
          ]
        },
        "method": "POST",
        "proto": "HTTP/1.1",
        "host": "piko-cluster-url",
        "request_headers": {
          "X-Transcend-Token": [
            "redacted"
          ],
          "Accept": [
            "application/json"
          ],
          "X-Datadog-Parent-Id": [
            "redacted"
          ],
          "X-Datadog-Trace-Id": [
            "redacted"
          ],
          "X-Forwarded-Proto": [
            "https"
          ],
          "X-Piko-Authorization": [
            "redacted"
          ],
          "X-Data-Silo-Type": [
            "redacted"
          ],
          "X-Forwarded-Port": [
            "8000"
          ],
          "X-Amzn-Trace-Id": [
            "redacted"
          ],
          "X-Sombra-Request-Id": [
            "redacted"
          ],
          "X-Piko-Forward": [
            "true"
          ],
          "X-Datadog-Sampling-Priority": [
            "2"
          ],
          "X-Sombra-Organization-Id": [
            "redacted"
          ],
          "X-Forwarded-For": [
            "redacted"
          ],
          "Content-Length": [
            "2868"
          ],
          "X-Request-Type": [
            "redacted"
          ],
          "X-Sombra-Organization-Uri": [
            "redacted"
          ],
          "Content-Type": [
            "application/json"
          ],
          "X-Data-Silo-Id": [
            "redacted"
          ],
          "X-Piko-Endpoint": [
            "redacted"
          ]
        },
        "status": 502
      },
      "level": "warn",
      "subsystem": "proxy.access",
      "ts": "2025-02-28T17:13:54.955Z"
    }
}

@andydunstall
Copy link
Owner

Yeah what I think is happening here is that the connection between the service and the Piko agent is being broken such that the Piko agent never receives the response from downstream, hence the upstream unreachable that Piko agent responds with to the caller.

I agree, therefore it doesn't sound like there is an issue with Piko (but is instead an issue with the service)? So can I close this issue? (Or I might be misunderstanding?)

@dipack95
Copy link
Contributor Author

dipack95 commented Mar 6, 2025

@andydunstall I was wondering if you have ever experienced something similar with your usage of Piko!

Anyway, I plan on modifying the parameters here, and seeing if that makes a difference in the aborted connections we're seeing.

I will keep you posted here on what I find - I'd appreciate it if you could keep this issue open for a bit longer.

@andydunstall
Copy link
Owner

I was wondering if you have ever experienced something similar with your usage of Piko!

Nope I haven't

Anyway, I plan on modifying the parameters here,

Those are for dial timeouts and idle connections? I'm not sure that will affect "active" connections? Plus if the issue is a timeout I'd expect a timeout error rather than EOF?

I'd appreciate it if you could keep this issue open for a bit longer

Yep sure!

@dipack95
Copy link
Contributor Author

dipack95 commented Mar 6, 2025

Those are for dial timeouts and idle connections? I'm not sure that will affect "active" connections? Plus if the issue is a timeout I'd expect a timeout error rather than EOF?

For context here, the agent is used to proxy a service that can frequently take a few minutes to synchronously respond, but usually responds within a few seconds. The piko cluster is currently behind an AWS ALB, and AFAIK the AWS ALB does not send any packets to open connections to keep them alive, which can be an issue with the aforementioned proxied service.

My hunch here is that the lower idle connection limit, and lower (compared to our service's expectations) keep alive timeouts in the piko agent, might be causing some open connections to be terminated unceremoniously.

I plan on taking a look at this next week, so hopefully I have an update for you one way or another!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants