How to deal with empty content-length #10

9mm · 2018-12-17T17:28:23Z

Hey I am making an issue because I saw a branch and made a comment but I don't know if comments email notify.

I'm wondering if you plan to merge that branch with the empty content length + head request, as I'm dealing with a million timeouts on an API which returns no content length when they have an error

petrohi · 2018-12-17T18:57:37Z

Can you give an example of request that is timing out?

9mm · 2018-12-17T19:00:16Z

I'll log that when i get back from the DR

9mm · 2018-12-17T19:17:20Z

btw the requests are all identical, it is the responses that are weird.

This may be a really deeper problem. I'm trying to scale to 25,000 QPS and there's some crazy bottleneck somewhere and I can't figure out where it is.

Memory and CPU seem fine, inside observer there is nothing with a high MsgQ or Reductions.

The way I'm measuring total timeout is like this:

    {response_time, request} = :timer.tc(fn ->
      MachineGun.get(url,
        [{"accept", "*/*"}, {"user-agent", "Test/1.0"}],
        %{request_timeout: @request_timeout, pool_group: :app})
    end)

and

config :machine_gun, app: %{
  pool_size: 300,          # Poolboy size [1]
  pool_max_overflow: 1000, # Poolboy max_overflow [1]
  pool_timeout: 10_000,
  conn_opts: %{}           # Gun connection options [2]
}

Response timeout is set to 1000

I can see the time reported by that goes wildly crazy sometimes, but I'm 99.9% sure it's not the API which is the issue (this is a super high volume API that handle literally millions of queries per second.). I notice the timeouts get worse when my system is bottlenecked, which lead me to believe something is locking it up/blocking. I've also noticed timeouts reported from :timer.tc are longer than 1000ms + change (as much as 5000ms which causes Task.async to fail).

Ive logged in when servers are under EXTREME load to see this (this is contrary to what I said above where this happens when not under extreme load. The problem is when I log in when NOT under extreme load I can't see anything in observer which would signal the system is under load and would warrant a timeout, so I look also when its under very heavy load to hopefully exaggerate the issue so I can see it). Here is when it's exaggerated:

but again, this also happens when observer looks as innocent as a fly (as far as i can tell)

FWIW this is how my API works:

A request comes in @ roughly 100,000 Queries per minute 1,666 per second across 10 nodes is 166 per second... a reasonable number
In order to serve these requests I need to hit 3 separate API endpoints, get their data, ignore failed responses/timeouts, 'aggregate' the data, and then take the aggregation and return that to the original request (so its 3 requests inside of 1 request). To achieve this I use Task.async/await, with machine gun running in each task.
I have a 1000ms timeout, which means... to me... there is a fairly restricted upper limit on resource usage. Lets say 100% of the requests time out, thats 166 incoming requests * 3 outgoing requests = 498 outgoing requests per second, with a timeout of 1000ms, which means it should never really go about ~500 processes. This to me seems "sane" and very scalable... but something is definitely not right here and its bottlenecking. To further prove this, I can enable an API which normally responds slower (a 4th endpoint) and if it times out a lot, it will drag down other endpoints causing them to report a timeout. So definitely something is happening "inside" gun or machine gun which is affected by resources other than the API itself.

9mm · 2018-12-17T19:23:20Z

to expand on this sentence: but again, this also happens when observer looks as innocent as a fly (as far as i can tell)

These requests kind of come in at a very gentle ebb and flow every few minutes, +/- 5% traffic.

Right now, the traffic level is RIGHT on the cusp where timeouts are starting to happen. Yet when I look at usage, memory/cpu/scheduler seem fine (I'm new to elixir and erlang, so i could be misreading it but msgq is 0 and reductions are not an insane #. id be happy to show screenshots)

I can watch the requests per second increase by 3-5% and suddenly timeouts start appearing. because this is an API all requests are formatted 100% the same way, if that makes a difference. its not like certain 'user requests' which are all different are failing.

When the request volume drops a few %, the timeouts vanish. If i increase servers when the % is increased, they will be able to handle the SAME amount of queries per second when it times out, and they wont time out. In other words, adding more servers at the same volume removes all timeouts, which proves to me its not the actual API but something happening in either gun or machine gun wrapper

9mm · 2018-12-17T21:55:25Z

And as one final point, a small % of the requests return no body + no content-length (if theres a client side error, usually a small fraction of the time). That alone doesnt come close to explaining what I described above but I figured it wouldnt hurt to fix that as well, just to eliminate a variable into the root cause (who knows what kind of weird chain of events it could cause).

petrohi · 2018-12-18T01:54:19Z

Issue #1 is about requests with empty body, not responses. So not an issue in your case.

I do see message queue for gun_sup at 152. Can you correlate size of this message queue with timeouts?

9mm · 2018-12-18T02:09:00Z

When its severely under load, it bounces all around and it's kind of hard to get a perfect snapshot because observer changes every second... but when its NOT under load (yet theres still a ton of timeouts) it looks like this.. very calm (as far as i can tell...):

If I sort by MsgQ theres litearlly not a single 1, they're all 0

sorted by reds:

Here's an overview of my internal stats tracker where I log every failure type + response time (using the timer code above) as well as all error codes returned from machine gun:

9mm · 2018-12-18T02:18:53Z

I should also mention the average response time is 11ms so it kind of makes logical sense that it would be weird for some random ones are timing out. its because theyre just a HARE overloaded. when the volume drops to say... ~2000 vs 2100-2500 it will have 0 timeouts

9mm · 2018-12-18T20:54:18Z

Ok so just to reconfirm I am understanding how pool sizes work...

Right now I have these settings:

config :machine_gun, app: %{
  pool_size: 300,          # Poolboy size [1]
  pool_max_overflow: 5_000, # Poolboy max_overflow [1]
  pool_timeout: 500,
  conn_opts: %{}           # Gun connection options [2]
}

I'm getting 61 RPS on one machine and it just got 23 pool timeouts. How does that math add up? The timeout value is set to 500ms

that means 61 * 3 = 183 outgoing requests per second which time out halfway into the second. How would I ever overflow 5000?

9mm · 2018-12-18T21:47:48Z

I set it to 50,000 and its still getting pool timeouts. very weird

petrohi · 2018-12-18T22:39:51Z

Can you try one thing: define identical configuration for 4 pools (e.g. app0, app1 etc). Then use random pool when requesting. Do your numbers change?

9mm · 2018-12-19T01:51:57Z

I did try that and I dont perceive any difference

9mm · 2018-12-19T02:19:05Z

Is there a way to log when a connection is closed. Perhaps these endpoints prefer regular requests instead of keepalive and its closing connections quickly which result in poolboy continuously connecting

9mm · 2018-12-19T23:32:12Z

@petrohi when you define multiple configurations do you do it all in the same config block or can i copy and paste the entire thing? I was thinking this when i first added the library but IMO you should cause a warning if the pool group doesnt exist. Im thinking this entire time its not even using the actual pool group because i defined 2 groups in separate config blocks....maybe. im testing it now

9mm · 2018-12-19T23:48:16Z

ok... is there ANY other way to see if I can see what the current config options for machien gun are? Ive literally set my config to....

    pool_size: 3,           # Poolboy size [1]
    pool_max_overflow: 3, # Poolboy max_overflow [1]

And theres literally no change @ 100 RPS... its almost like its not been reading my config this entire time. This is exactly how i have it written in config/config.exs:

config :machine_gun,
  api: %{
    pool_size: 30,           # Poolboy size [1]
    pool_max_overflow: 1_000, # Poolboy max_overflow [1]
    pool_timeout: 500,
    conn_opts: %{}            # Gun connection options [2]
  }

And how I use it:

  MachineGun.get(url,
    [{"accept", "*/*"}, {"user-agent", "Test/1.0"}],
    %{request_timeout: @request_timeout, pool_group: :api})

9mm · 2018-12-19T23:56:22Z

ok, now I see pool timeouts if i put them both at 3, it took a minute. Regardless, even if its set to 50_000 max overflow its weird that im seeing pool timeouts. I dont see how thats possible to fill up that many workers at this volume

petrohi · 2018-12-20T06:08:32Z

You are setting config correctly. One way to see what are the actual values is to go to deps/machine_gun in your project and insert IO.inspects in places where you see pool_size, pool_timeout etc being used.

petrohi · 2018-12-20T06:09:51Z

Defining separate sections starting with config :machine_gun, and putting all pool groups comma separated under one should be identical configuration.

petrohi · 2018-12-20T06:16:46Z

My general point is that pool timeouts are not only function of pool size. Poolboy manages pool state with dedicated process. This means that all operations on the pool are serialized. Even if pool size is big enough the rate of pool operations may overload pool process.

Can you try this: determine what is the maximum rate that you can serve with single MachineGun pool group without getting pool timeouts. Assuming this rate is not saturating your CPU or network interface, add another pool group and double the rate. Do you get pool timeouts now?

9mm · 2018-12-20T06:50:57Z

The pool timeouts are quite rare, thats not really the issue, that was mostly me just brainstorming to make sure it was actually reading my config.

The main issue is actual/regular timeouts (machine gun returns :error with timeout as the reason). The thing is, the API isnt actually timing out, its machine gun taking long to process something (like gun backing up), which makes it think the response timed out. It's like its including the time it takes for gun to respond in the timeout amount (or maybe gun is doing this).

I know its not the API causing the problem because I get less timeouts the less outbound requests I make, even if the volume to the API remains 100% the same

petrohi · 2018-12-21T00:52:13Z

So you get error with :request_timeout reason?

9mm · 2018-12-21T02:18:58Z

Yes correct.

petrohi · 2018-12-21T18:01:05Z

My suspicion is that MachineGun worker gets blocked on starting connection with gun supervisor--this is the only place where it blocks. Do you see non-zero message queues with gun related processes when these timeouts happen?

9mm · 2018-12-21T18:04:29Z

@petrohi that's whats weird. If it's REALLY bad, yes I see backups from 2 things. inet_gethost_native as well as gun.

With that said, this is part of what makes this issue so frustrating that I wrote about above. It doesn't have to be extremely bad for timeouts to occur.

90% of the time, when its on the "cusp", I will get timeouts 20% of the time and it will slowly increase until it DOES get bad and I see msgq backups.

But before it gets to that point I can't find anything in obvserver which would seem to indicate a problem. MsgQ will be 0 and reductions will seem normal. Scheduler wont be going crazy.

Its like this invisible bottleneck where the CPU just starts overloading.

Gigalixir guy was thinking maybe inet_gethost_native is getting backed up on DNS lookups but it still doesnt explain why msgq is 0 in observer when there still are 20-30% timeouts. Msgq backups start occuring at like 80% timeouts

9mm · 2018-12-21T19:15:34Z

You know I did see this ninenines/gun#183

I'm not sure if thats related but he also mentions 100% CPU. And when this happens, I get a whole bunch of proc_lib_hack

petrohi · 2018-12-21T19:29:40Z

And setting request_timeout in config does not affect this behaviour in any way, correct?

9mm · 2018-12-21T19:31:31Z

Well right now its at 500ms and I experience issues at ~300 requests per second which is basically beans as far as performance (on an 8 vCPU instance with 4gb memory). Ive tried playing with the timeout, I got it low as 200ms and still experienced issues so I stopped trying to fiddle with it as there seems to be deeper problem

petrohi · 2018-12-21T19:40:11Z

What if you make it much larger? Like 10 seconds?

9mm · 2018-12-21T19:45:22Z

Lol that will be interesting. I'll try it.

9mm · 2018-12-21T20:59:52Z

Does this configuration look correct? I'm not sure if I set it up right, or which ones of these machine_gun overwrites. I'm mostly talking about the http_opts and opts keys of conn_opts

  api: %{
    pool_size: 100,           # Poolboy size
    pool_max_overflow: 5_000, # Poolboy max_overflow
    pool_timeout: 500,
    conn_opts: %{             # Gun connection options (https://ninenines.eu/docs/en/gun/1.3/manual/gun/)
      http_opts: %{
        keepalive: 5000
      },
      opts: %{
        retry: 5,
        retry_timeout: 5000,
        trace: true
      }
    }
  }

9mm · 2018-12-21T21:06:38Z

Correction it seems to be like this:

config :machine_gun,
  api: %{
    pool_size: 100,           # Poolboy size
    pool_max_overflow: 5000, # Poolboy max_overflow
    pool_timeout: 500,
    conn_opts: %{             # Gun connection options (https://ninenines.eu/docs/en/gun/1.3/manual/gun/)
      retry: 5,
      retry_timeout: 5000,
      trace: false,
      http_opts: %{
        keepalive: 5000
      }
    }
  }

petrohi · 2018-12-21T21:20:50Z

Try adding request_timeout: 10_000, under pool_timeout: 500,.

9mm · 2018-12-21T21:22:45Z

Great idea about increasing the timeout. That made some things really stand out as wrong.

I first put everything locally again so I don't have to deploy. I enabled my 3 API's. I set the MG request timeout at 10s.

I then refreshed 2-3 times a second. This is quite literally, 2-3 RPS.

I would see a 100ms response time and then bam, a VERY long response... up to 6 seconds, from just one of the APIs (which makes the whole request take long because it uses Async/Await to wait on the longest one of the 3 requests)

I do a few more refreshes and then bam, another 6000ms response where it hangs.

If I take that EXACT url and put it in my browser and refresh it there, it NEVER has a delay

petrohi · 2018-12-21T21:32:00Z

So it occasionally slow with that one API and never slow with another?

Can you try forcing HTTP/1 just in case this is some bug in HTTP/2 implementation:

...
conn_opts: %{ 
  protocols: [:http],
  ...

9mm · 2018-12-21T21:41:44Z

I tried forcing [:http] and it doesnt make a difference.

Only that 1 endpoint has this crazy delay that long (5000ms) and its the slowest of the 3 endpoints. BUT, I think the issue isn't with that one endpoint, but rather (maybe) slower responses in general. Perhaps the slower the endpoint is the more it exaggerates the issue exponentially. The problem applies in production when that slow endpoint is disabled, but I do notice the issue gets way worse, way more quickly, when that endpoint is enabled. I think, locally, right now while testing, its not as pronounced because the 2 remaining APIs are faster. I think the problem applies still to these other 2, but because they almost always respond in 100ms the problem doesnt appear until it gets to production where they're under very heavy load, and then the problem appears again.

This makes sense why things report as "timing out" even though the API seems fine and theyre not "actually" timing out.

9mm · 2018-12-21T21:50:09Z

Here's 10 refreshes of response time with the URL directly in chrome:

244, 102, 207, 243, 208, 211, 255, 149, 202, 276, 102

here it is when using machine gun, setting the timeout to 10_000ms

4636, 97, 459, 101, 102, 4690, 4538, 102

9mm · 2018-12-21T21:54:53Z

So yes, this is making it very clear why things are reporting timeouts when they really arent timing out.

Those 2 apis i mentioned which have a resonse time of 100-200, if I put my localhost under a light load with apache bench, i can get them to report a 400ms response time, similar to my example above, but less pronounced.

That kind of goes inline with what I said above where something is causing the issue to be exaggerated on that one endpoint. Maybe its response time or something else, I'm not sure, but the issue applies to all endpoints not just that 1.

9mm · 2018-12-21T21:57:34Z

Also I should point out, if I put pool_size to 1, and reboot phoenix, and refresh 1-2 times it happens right off the bat, which leads me to believe changing pool size or fiddlign with pool settings wouldn't really affect it.

9mm · 2018-12-21T23:20:34Z

Heres a normal request when all 3 endpoints return fast

Here is what it looks like when 1 takes 5 seconds. I wish I knew how to read this lol

9mm · 2018-12-21T23:23:30Z

another bad one for good measure

9mm · 2018-12-21T23:36:14Z

If you were to have me put a log debug line in your library somewhere to help track down if its on your side or in gun, where should I put it?

petrohi · 2018-12-25T11:34:40Z

These graphs are very helpful--thanks. I will look into them in few days.

petrohi · 2018-12-26T00:28:48Z

Can you share example cURL with that API?

9mm · 2018-12-26T03:01:28Z

I could send it privately if you tell me your email

petrohi · 2018-12-28T20:36:12Z

peter dot hizalev at gmail

9mm mentioned this issue Dec 18, 2018

Performance question ninenines/gun#182

Closed

How to deal with empty content-length #10

How to deal with empty content-length #10

Comments

9mm commented Dec 17, 2018 • edited Loading

petrohi commented Dec 17, 2018

9mm commented Dec 17, 2018

9mm commented Dec 17, 2018 • edited Loading

9mm commented Dec 17, 2018

9mm commented Dec 17, 2018 • edited Loading

petrohi commented Dec 18, 2018

9mm commented Dec 18, 2018

9mm commented Dec 18, 2018

9mm commented Dec 18, 2018

9mm commented Dec 18, 2018

petrohi commented Dec 18, 2018

9mm commented Dec 19, 2018

9mm commented Dec 19, 2018

9mm commented Dec 19, 2018

9mm commented Dec 19, 2018 • edited Loading

9mm commented Dec 19, 2018

petrohi commented Dec 20, 2018

petrohi commented Dec 20, 2018

petrohi commented Dec 20, 2018

9mm commented Dec 20, 2018

petrohi commented Dec 21, 2018

9mm commented Dec 21, 2018

petrohi commented Dec 21, 2018

9mm commented Dec 21, 2018 • edited Loading

9mm commented Dec 21, 2018

petrohi commented Dec 21, 2018

9mm commented Dec 21, 2018

petrohi commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

petrohi commented Dec 21, 2018

9mm commented Dec 21, 2018

petrohi commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

9mm commented Dec 21, 2018

petrohi commented Dec 25, 2018

petrohi commented Dec 26, 2018

9mm commented Dec 26, 2018

petrohi commented Dec 28, 2018

9mm commented Dec 17, 2018 •

edited

Loading

9mm commented Dec 17, 2018 •

edited

Loading

9mm commented Dec 17, 2018 •

edited

Loading

9mm commented Dec 19, 2018 •

edited

Loading

9mm commented Dec 21, 2018 •

edited

Loading