Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with empty content-length #10

Open
9mm opened this issue Dec 17, 2018 · 45 comments
Open

How to deal with empty content-length #10

9mm opened this issue Dec 17, 2018 · 45 comments

Comments

@9mm
Copy link

9mm commented Dec 17, 2018

Hey I am making an issue because I saw a branch and made a comment but I don't know if comments email notify.

I'm wondering if you plan to merge that branch with the empty content length + head request, as I'm dealing with a million timeouts on an API which returns no content length when they have an error

@petrohi
Copy link
Owner

petrohi commented Dec 17, 2018

Can you give an example of request that is timing out?

@9mm
Copy link
Author

9mm commented Dec 17, 2018

I'll log that when i get back from the DR

@9mm
Copy link
Author

9mm commented Dec 17, 2018

btw the requests are all identical, it is the responses that are weird.

This may be a really deeper problem. I'm trying to scale to 25,000 QPS and there's some crazy bottleneck somewhere and I can't figure out where it is.

Memory and CPU seem fine, inside observer there is nothing with a high MsgQ or Reductions.

The way I'm measuring total timeout is like this:

    {response_time, request} = :timer.tc(fn ->
      MachineGun.get(url,
        [{"accept", "*/*"}, {"user-agent", "Test/1.0"}],
        %{request_timeout: @request_timeout, pool_group: :app})
    end)

and

config :machine_gun, app: %{
  pool_size: 300,          # Poolboy size [1]
  pool_max_overflow: 1000, # Poolboy max_overflow [1]
  pool_timeout: 10_000,
  conn_opts: %{}           # Gun connection options [2]
}

Response timeout is set to 1000

I can see the time reported by that goes wildly crazy sometimes, but I'm 99.9% sure it's not the API which is the issue (this is a super high volume API that handle literally millions of queries per second.). I notice the timeouts get worse when my system is bottlenecked, which lead me to believe something is locking it up/blocking. I've also noticed timeouts reported from :timer.tc are longer than 1000ms + change (as much as 5000ms which causes Task.async to fail).

Ive logged in when servers are under EXTREME load to see this (this is contrary to what I said above where this happens when not under extreme load. The problem is when I log in when NOT under extreme load I can't see anything in observer which would signal the system is under load and would warrant a timeout, so I look also when its under very heavy load to hopefully exaggerate the issue so I can see it). Here is when it's exaggerated:

screen shot 2018-12-17 at 11 11 54 am

screen shot 2018-12-17 at 11 12 01 am

but again, this also happens when observer looks as innocent as a fly (as far as i can tell)

FWIW this is how my API works:

  1. A request comes in @ roughly 100,000 Queries per minute 1,666 per second across 10 nodes is 166 per second... a reasonable number
  2. In order to serve these requests I need to hit 3 separate API endpoints, get their data, ignore failed responses/timeouts, 'aggregate' the data, and then take the aggregation and return that to the original request (so its 3 requests inside of 1 request). To achieve this I use Task.async/await, with machine gun running in each task.
  3. I have a 1000ms timeout, which means... to me... there is a fairly restricted upper limit on resource usage. Lets say 100% of the requests time out, thats 166 incoming requests * 3 outgoing requests = 498 outgoing requests per second, with a timeout of 1000ms, which means it should never really go about ~500 processes. This to me seems "sane" and very scalable... but something is definitely not right here and its bottlenecking. To further prove this, I can enable an API which normally responds slower (a 4th endpoint) and if it times out a lot, it will drag down other endpoints causing them to report a timeout. So definitely something is happening "inside" gun or machine gun which is affected by resources other than the API itself.

@9mm
Copy link
Author

9mm commented Dec 17, 2018

to expand on this sentence: but again, this also happens when observer looks as innocent as a fly (as far as i can tell)

These requests kind of come in at a very gentle ebb and flow every few minutes, +/- 5% traffic.

Right now, the traffic level is RIGHT on the cusp where timeouts are starting to happen. Yet when I look at usage, memory/cpu/scheduler seem fine (I'm new to elixir and erlang, so i could be misreading it but msgq is 0 and reductions are not an insane #. id be happy to show screenshots)

I can watch the requests per second increase by 3-5% and suddenly timeouts start appearing. because this is an API all requests are formatted 100% the same way, if that makes a difference. its not like certain 'user requests' which are all different are failing.

When the request volume drops a few %, the timeouts vanish. If i increase servers when the % is increased, they will be able to handle the SAME amount of queries per second when it times out, and they wont time out. In other words, adding more servers at the same volume removes all timeouts, which proves to me its not the actual API but something happening in either gun or machine gun wrapper

@9mm
Copy link
Author

9mm commented Dec 17, 2018

And as one final point, a small % of the requests return no body + no content-length (if theres a client side error, usually a small fraction of the time). That alone doesnt come close to explaining what I described above but I figured it wouldnt hurt to fix that as well, just to eliminate a variable into the root cause (who knows what kind of weird chain of events it could cause).

@petrohi
Copy link
Owner

petrohi commented Dec 18, 2018

Issue #1 is about requests with empty body, not responses. So not an issue in your case.

I do see message queue for gun_sup at 152. Can you correlate size of this message queue with timeouts?

@9mm
Copy link
Author

9mm commented Dec 18, 2018

When its severely under load, it bounces all around and it's kind of hard to get a perfect snapshot because observer changes every second... but when its NOT under load (yet theres still a ton of timeouts) it looks like this.. very calm (as far as i can tell...):

image

image

If I sort by MsgQ theres litearlly not a single 1, they're all 0

image

sorted by reds:

image

image

image

Here's an overview of my internal stats tracker where I log every failure type + response time (using the timer code above) as well as all error codes returned from machine gun:

image

@9mm
Copy link
Author

9mm commented Dec 18, 2018

I should also mention the average response time is 11ms so it kind of makes logical sense that it would be weird for some random ones are timing out. its because theyre just a HARE overloaded. when the volume drops to say... ~2000 vs 2100-2500 it will have 0 timeouts

@9mm
Copy link
Author

9mm commented Dec 18, 2018

Ok so just to reconfirm I am understanding how pool sizes work...

Right now I have these settings:

config :machine_gun, app: %{
  pool_size: 300,          # Poolboy size [1]
  pool_max_overflow: 5_000, # Poolboy max_overflow [1]
  pool_timeout: 500,
  conn_opts: %{}           # Gun connection options [2]
}

I'm getting 61 RPS on one machine and it just got 23 pool timeouts. How does that math add up? The timeout value is set to 500ms

that means 61 * 3 = 183 outgoing requests per second which time out halfway into the second. How would I ever overflow 5000?

@9mm
Copy link
Author

9mm commented Dec 18, 2018

I set it to 50,000 and its still getting pool timeouts. very weird

@petrohi
Copy link
Owner

petrohi commented Dec 18, 2018

Can you try one thing: define identical configuration for 4 pools (e.g. app0, app1 etc). Then use random pool when requesting. Do your numbers change?

@9mm
Copy link
Author

9mm commented Dec 19, 2018

I did try that and I dont perceive any difference

@9mm
Copy link
Author

9mm commented Dec 19, 2018

Is there a way to log when a connection is closed. Perhaps these endpoints prefer regular requests instead of keepalive and its closing connections quickly which result in poolboy continuously connecting

@9mm
Copy link
Author

9mm commented Dec 19, 2018

@petrohi when you define multiple configurations do you do it all in the same config block or can i copy and paste the entire thing? I was thinking this when i first added the library but IMO you should cause a warning if the pool group doesnt exist. Im thinking this entire time its not even using the actual pool group because i defined 2 groups in separate config blocks....maybe. im testing it now

@9mm
Copy link
Author

9mm commented Dec 19, 2018

ok... is there ANY other way to see if I can see what the current config options for machien gun are? Ive literally set my config to....

    pool_size: 3,           # Poolboy size [1]
    pool_max_overflow: 3, # Poolboy max_overflow [1]

And theres literally no change @ 100 RPS... its almost like its not been reading my config this entire time. This is exactly how i have it written in config/config.exs:

config :machine_gun,
  api: %{
    pool_size: 30,           # Poolboy size [1]
    pool_max_overflow: 1_000, # Poolboy max_overflow [1]
    pool_timeout: 500,
    conn_opts: %{}            # Gun connection options [2]
  }

And how I use it:

  MachineGun.get(url,
    [{"accept", "*/*"}, {"user-agent", "Test/1.0"}],
    %{request_timeout: @request_timeout, pool_group: :api})

@9mm
Copy link
Author

9mm commented Dec 19, 2018

ok, now I see pool timeouts if i put them both at 3, it took a minute. Regardless, even if its set to 50_000 max overflow its weird that im seeing pool timeouts. I dont see how thats possible to fill up that many workers at this volume

@petrohi
Copy link
Owner

petrohi commented Dec 20, 2018

You are setting config correctly. One way to see what are the actual values is to go to deps/machine_gun in your project and insert IO.inspects in places where you see pool_size, pool_timeout etc being used.

@petrohi
Copy link
Owner

petrohi commented Dec 20, 2018

Defining separate sections starting with config :machine_gun, and putting all pool groups comma separated under one should be identical configuration.

@petrohi
Copy link
Owner

petrohi commented Dec 20, 2018

My general point is that pool timeouts are not only function of pool size. Poolboy manages pool state with dedicated process. This means that all operations on the pool are serialized. Even if pool size is big enough the rate of pool operations may overload pool process.

Can you try this: determine what is the maximum rate that you can serve with single MachineGun pool group without getting pool timeouts. Assuming this rate is not saturating your CPU or network interface, add another pool group and double the rate. Do you get pool timeouts now?

@9mm
Copy link
Author

9mm commented Dec 20, 2018

The pool timeouts are quite rare, thats not really the issue, that was mostly me just brainstorming to make sure it was actually reading my config.

The main issue is actual/regular timeouts (machine gun returns :error with timeout as the reason). The thing is, the API isnt actually timing out, its machine gun taking long to process something (like gun backing up), which makes it think the response timed out. It's like its including the time it takes for gun to respond in the timeout amount (or maybe gun is doing this).

I know its not the API causing the problem because I get less timeouts the less outbound requests I make, even if the volume to the API remains 100% the same

@petrohi
Copy link
Owner

petrohi commented Dec 21, 2018

So you get error with :request_timeout reason?

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Yes correct.

@petrohi
Copy link
Owner

petrohi commented Dec 21, 2018

My suspicion is that MachineGun worker gets blocked on starting connection with gun supervisor--this is the only place where it blocks. Do you see non-zero message queues with gun related processes when these timeouts happen?

@9mm
Copy link
Author

9mm commented Dec 21, 2018

@petrohi that's whats weird. If it's REALLY bad, yes I see backups from 2 things. inet_gethost_native as well as gun.

With that said, this is part of what makes this issue so frustrating that I wrote about above. It doesn't have to be extremely bad for timeouts to occur.

90% of the time, when its on the "cusp", I will get timeouts 20% of the time and it will slowly increase until it DOES get bad and I see msgq backups.

But before it gets to that point I can't find anything in obvserver which would seem to indicate a problem. MsgQ will be 0 and reductions will seem normal. Scheduler wont be going crazy.

Its like this invisible bottleneck where the CPU just starts overloading.

Gigalixir guy was thinking maybe inet_gethost_native is getting backed up on DNS lookups but it still doesnt explain why msgq is 0 in observer when there still are 20-30% timeouts. Msgq backups start occuring at like 80% timeouts

@9mm
Copy link
Author

9mm commented Dec 21, 2018

You know I did see this ninenines/gun#183

I'm not sure if thats related but he also mentions 100% CPU. And when this happens, I get a whole bunch of proc_lib_hack

@petrohi
Copy link
Owner

petrohi commented Dec 21, 2018

And setting request_timeout in config does not affect this behaviour in any way, correct?

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Well right now its at 500ms and I experience issues at ~300 requests per second which is basically beans as far as performance (on an 8 vCPU instance with 4gb memory). Ive tried playing with the timeout, I got it low as 200ms and still experienced issues so I stopped trying to fiddle with it as there seems to be deeper problem

@petrohi
Copy link
Owner

petrohi commented Dec 21, 2018

What if you make it much larger? Like 10 seconds?

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Lol that will be interesting. I'll try it.

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Does this configuration look correct? I'm not sure if I set it up right, or which ones of these machine_gun overwrites. I'm mostly talking about the http_opts and opts keys of conn_opts

  api: %{
    pool_size: 100,           # Poolboy size
    pool_max_overflow: 5_000, # Poolboy max_overflow
    pool_timeout: 500,
    conn_opts: %{             # Gun connection options (https://ninenines.eu/docs/en/gun/1.3/manual/gun/)
      http_opts: %{
        keepalive: 5000
      },
      opts: %{
        retry: 5,
        retry_timeout: 5000,
        trace: true
      }
    }
  }

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Correction it seems to be like this:

config :machine_gun,
  api: %{
    pool_size: 100,           # Poolboy size
    pool_max_overflow: 5000, # Poolboy max_overflow
    pool_timeout: 500,
    conn_opts: %{             # Gun connection options (https://ninenines.eu/docs/en/gun/1.3/manual/gun/)
      retry: 5,
      retry_timeout: 5000,
      trace: false,
      http_opts: %{
        keepalive: 5000
      }
    }
  }

@petrohi
Copy link
Owner

petrohi commented Dec 21, 2018

Try adding request_timeout: 10_000, under pool_timeout: 500,.

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Great idea about increasing the timeout. That made some things really stand out as wrong.

I first put everything locally again so I don't have to deploy. I enabled my 3 API's. I set the MG request timeout at 10s.

I then refreshed 2-3 times a second. This is quite literally, 2-3 RPS.

I would see a 100ms response time and then bam, a VERY long response... up to 6 seconds, from just one of the APIs (which makes the whole request take long because it uses Async/Await to wait on the longest one of the 3 requests)

I do a few more refreshes and then bam, another 6000ms response where it hangs.

If I take that EXACT url and put it in my browser and refresh it there, it NEVER has a delay

@petrohi
Copy link
Owner

petrohi commented Dec 21, 2018

So it occasionally slow with that one API and never slow with another?

Can you try forcing HTTP/1 just in case this is some bug in HTTP/2 implementation:

...
conn_opts: %{ 
  protocols: [:http],
  ...

@9mm
Copy link
Author

9mm commented Dec 21, 2018

I tried forcing [:http] and it doesnt make a difference.

Only that 1 endpoint has this crazy delay that long (5000ms) and its the slowest of the 3 endpoints. BUT, I think the issue isn't with that one endpoint, but rather (maybe) slower responses in general. Perhaps the slower the endpoint is the more it exaggerates the issue exponentially. The problem applies in production when that slow endpoint is disabled, but I do notice the issue gets way worse, way more quickly, when that endpoint is enabled. I think, locally, right now while testing, its not as pronounced because the 2 remaining APIs are faster. I think the problem applies still to these other 2, but because they almost always respond in 100ms the problem doesnt appear until it gets to production where they're under very heavy load, and then the problem appears again.

This makes sense why things report as "timing out" even though the API seems fine and theyre not "actually" timing out.

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Here's 10 refreshes of response time with the URL directly in chrome:

244, 102, 207, 243, 208, 211, 255, 149, 202, 276, 102

here it is when using machine gun, setting the timeout to 10_000ms

4636, 97, 459, 101, 102, 4690, 4538, 102

@9mm
Copy link
Author

9mm commented Dec 21, 2018

So yes, this is making it very clear why things are reporting timeouts when they really arent timing out.

Those 2 apis i mentioned which have a resonse time of 100-200, if I put my localhost under a light load with apache bench, i can get them to report a 400ms response time, similar to my example above, but less pronounced.

That kind of goes inline with what I said above where something is causing the issue to be exaggerated on that one endpoint. Maybe its response time or something else, I'm not sure, but the issue applies to all endpoints not just that 1.

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Also I should point out, if I put pool_size to 1, and reboot phoenix, and refresh 1-2 times it happens right off the bat, which leads me to believe changing pool size or fiddlign with pool settings wouldn't really affect it.

@9mm
Copy link
Author

9mm commented Dec 21, 2018

Heres a normal request when all 3 endpoints return fast

image

Here is what it looks like when 1 takes 5 seconds. I wish I knew how to read this lol

image

@9mm
Copy link
Author

9mm commented Dec 21, 2018

another bad one for good measure

image

@9mm
Copy link
Author

9mm commented Dec 21, 2018

If you were to have me put a log debug line in your library somewhere to help track down if its on your side or in gun, where should I put it?

@petrohi
Copy link
Owner

petrohi commented Dec 25, 2018

These graphs are very helpful--thanks. I will look into them in few days.

@petrohi
Copy link
Owner

petrohi commented Dec 26, 2018

Can you share example cURL with that API?

@9mm
Copy link
Author

9mm commented Dec 26, 2018

I could send it privately if you tell me your email

@petrohi
Copy link
Owner

petrohi commented Dec 28, 2018

peter dot hizalev at gmail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants