-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with empty content-length #10
Comments
Can you give an example of request that is timing out? |
I'll log that when i get back from the DR |
btw the requests are all identical, it is the responses that are weird. This may be a really deeper problem. I'm trying to scale to 25,000 QPS and there's some crazy bottleneck somewhere and I can't figure out where it is. Memory and CPU seem fine, inside observer there is nothing with a high MsgQ or Reductions. The way I'm measuring total timeout is like this:
and
Response timeout is set to 1000 I can see the time reported by that goes wildly crazy sometimes, but I'm 99.9% sure it's not the API which is the issue (this is a super high volume API that handle literally millions of queries per second.). I notice the timeouts get worse when my system is bottlenecked, which lead me to believe something is locking it up/blocking. I've also noticed timeouts reported from Ive logged in when servers are under EXTREME load to see this (this is contrary to what I said above where this happens when not under extreme load. The problem is when I log in when NOT under extreme load I can't see anything in observer which would signal the system is under load and would warrant a timeout, so I look also when its under very heavy load to hopefully exaggerate the issue so I can see it). Here is when it's exaggerated: but again, this also happens when observer looks as innocent as a fly (as far as i can tell) FWIW this is how my API works:
|
to expand on this sentence: These requests kind of come in at a very gentle ebb and flow every few minutes, +/- 5% traffic. Right now, the traffic level is RIGHT on the cusp where timeouts are starting to happen. Yet when I look at usage, memory/cpu/scheduler seem fine (I'm new to elixir and erlang, so i could be misreading it but msgq is 0 and reductions are not an insane #. id be happy to show screenshots) I can watch the requests per second increase by 3-5% and suddenly timeouts start appearing. because this is an API all requests are formatted 100% the same way, if that makes a difference. its not like certain 'user requests' which are all different are failing. When the request volume drops a few %, the timeouts vanish. If i increase servers when the % is increased, they will be able to handle the SAME amount of queries per second when it times out, and they wont time out. In other words, adding more servers at the same volume removes all timeouts, which proves to me its not the actual API but something happening in either gun or machine gun wrapper |
And as one final point, a small % of the requests return no body + no content-length (if theres a client side error, usually a small fraction of the time). That alone doesnt come close to explaining what I described above but I figured it wouldnt hurt to fix that as well, just to eliminate a variable into the root cause (who knows what kind of weird chain of events it could cause). |
Issue #1 is about requests with empty body, not responses. So not an issue in your case. I do see message queue for |
When its severely under load, it bounces all around and it's kind of hard to get a perfect snapshot because observer changes every second... but when its NOT under load (yet theres still a ton of timeouts) it looks like this.. very calm (as far as i can tell...): If I sort by sorted by reds: Here's an overview of my internal stats tracker where I log every failure type + response time (using the timer code above) as well as all error codes returned from machine gun: |
I should also mention the average response time is 11ms so it kind of makes logical sense that it would be weird for some random ones are timing out. its because theyre just a HARE overloaded. when the volume drops to say... ~2000 vs 2100-2500 it will have 0 timeouts |
Ok so just to reconfirm I am understanding how pool sizes work... Right now I have these settings:
I'm getting 61 RPS on one machine and it just got 23 pool timeouts. How does that math add up? The timeout value is set to 500ms that means 61 * 3 = 183 outgoing requests per second which time out halfway into the second. How would I ever overflow 5000? |
I set it to 50,000 and its still getting pool timeouts. very weird |
Can you try one thing: define identical configuration for 4 pools (e.g. app0, app1 etc). Then use random pool when requesting. Do your numbers change? |
I did try that and I dont perceive any difference |
Is there a way to log when a connection is closed. Perhaps these endpoints prefer regular requests instead of keepalive and its closing connections quickly which result in poolboy continuously connecting |
@petrohi when you define multiple configurations do you do it all in the same config block or can i copy and paste the entire thing? I was thinking this when i first added the library but IMO you should cause a warning if the pool group doesnt exist. Im thinking this entire time its not even using the actual pool group because i defined 2 groups in separate config blocks....maybe. im testing it now |
ok... is there ANY other way to see if I can see what the current config options for machien gun are? Ive literally set my config to....
And theres literally no change @ 100 RPS... its almost like its not been reading my config this entire time. This is exactly how i have it written in
And how I use it:
|
ok, now I see pool timeouts if i put them both at 3, it took a minute. Regardless, even if its set to 50_000 max overflow its weird that im seeing pool timeouts. I dont see how thats possible to fill up that many workers at this volume |
You are setting config correctly. One way to see what are the actual values is to go to deps/machine_gun in your project and insert |
Defining separate sections starting with |
My general point is that pool timeouts are not only function of pool size. Poolboy manages pool state with dedicated process. This means that all operations on the pool are serialized. Even if pool size is big enough the rate of pool operations may overload pool process. Can you try this: determine what is the maximum rate that you can serve with single MachineGun pool group without getting pool timeouts. Assuming this rate is not saturating your CPU or network interface, add another pool group and double the rate. Do you get pool timeouts now? |
The pool timeouts are quite rare, thats not really the issue, that was mostly me just brainstorming to make sure it was actually reading my config. The main issue is actual/regular timeouts (machine gun returns :error with timeout as the reason). The thing is, the API isnt actually timing out, its machine gun taking long to process something (like gun backing up), which makes it think the response timed out. It's like its including the time it takes for gun to respond in the timeout amount (or maybe gun is doing this). I know its not the API causing the problem because I get less timeouts the less outbound requests I make, even if the volume to the API remains 100% the same |
So you get error with :request_timeout reason? |
Yes correct. |
My suspicion is that MachineGun worker gets blocked on starting connection with gun supervisor--this is the only place where it blocks. Do you see non-zero message queues with gun related processes when these timeouts happen? |
@petrohi that's whats weird. If it's REALLY bad, yes I see backups from 2 things. With that said, this is part of what makes this issue so frustrating that I wrote about above. It doesn't have to be extremely bad for timeouts to occur. 90% of the time, when its on the "cusp", I will get timeouts 20% of the time and it will slowly increase until it DOES get bad and I see msgq backups. But before it gets to that point I can't find anything in obvserver which would seem to indicate a problem. MsgQ will be 0 and reductions will seem normal. Scheduler wont be going crazy. Its like this invisible bottleneck where the CPU just starts overloading. Gigalixir guy was thinking maybe inet_gethost_native is getting backed up on DNS lookups but it still doesnt explain why msgq is 0 in observer when there still are 20-30% timeouts. Msgq backups start occuring at like 80% timeouts |
You know I did see this ninenines/gun#183 I'm not sure if thats related but he also mentions 100% CPU. And when this happens, I get a whole bunch of |
And setting request_timeout in config does not affect this behaviour in any way, correct? |
Well right now its at 500ms and I experience issues at ~300 requests per second which is basically beans as far as performance (on an 8 vCPU instance with 4gb memory). Ive tried playing with the timeout, I got it low as 200ms and still experienced issues so I stopped trying to fiddle with it as there seems to be deeper problem |
What if you make it much larger? Like 10 seconds? |
Lol that will be interesting. I'll try it. |
Does this configuration look correct? I'm not sure if I set it up right, or which ones of these machine_gun overwrites. I'm mostly talking about the
|
Correction it seems to be like this:
|
Try adding |
Great idea about increasing the timeout. That made some things really stand out as wrong. I first put everything locally again so I don't have to deploy. I enabled my 3 API's. I set the MG request timeout at 10s. I then refreshed 2-3 times a second. This is quite literally, 2-3 RPS. I would see a 100ms response time and then bam, a VERY long response... up to 6 seconds, from just one of the APIs (which makes the whole request take long because it uses Async/Await to wait on the longest one of the 3 requests) I do a few more refreshes and then bam, another 6000ms response where it hangs. If I take that EXACT url and put it in my browser and refresh it there, it NEVER has a delay |
So it occasionally slow with that one API and never slow with another? Can you try forcing HTTP/1 just in case this is some bug in HTTP/2 implementation:
|
I tried forcing [:http] and it doesnt make a difference. Only that 1 endpoint has this crazy delay that long (5000ms) and its the slowest of the 3 endpoints. BUT, I think the issue isn't with that one endpoint, but rather (maybe) slower responses in general. Perhaps the slower the endpoint is the more it exaggerates the issue exponentially. The problem applies in production when that slow endpoint is disabled, but I do notice the issue gets way worse, way more quickly, when that endpoint is enabled. I think, locally, right now while testing, its not as pronounced because the 2 remaining APIs are faster. I think the problem applies still to these other 2, but because they almost always respond in 100ms the problem doesnt appear until it gets to production where they're under very heavy load, and then the problem appears again. This makes sense why things report as "timing out" even though the API seems fine and theyre not "actually" timing out. |
Here's 10 refreshes of response time with the URL directly in chrome: 244, 102, 207, 243, 208, 211, 255, 149, 202, 276, 102 here it is when using machine gun, setting the timeout to 10_000ms 4636, 97, 459, 101, 102, 4690, 4538, 102 |
So yes, this is making it very clear why things are reporting timeouts when they really arent timing out. Those 2 apis i mentioned which have a resonse time of 100-200, if I put my localhost under a light load with apache bench, i can get them to report a 400ms response time, similar to my example above, but less pronounced. That kind of goes inline with what I said above where something is causing the issue to be exaggerated on that one endpoint. Maybe its response time or something else, I'm not sure, but the issue applies to all endpoints not just that 1. |
Also I should point out, if I put |
If you were to have me put a log debug line in your library somewhere to help track down if its on your side or in gun, where should I put it? |
These graphs are very helpful--thanks. I will look into them in few days. |
Can you share example cURL with that API? |
I could send it privately if you tell me your email |
peter dot hizalev at gmail |
Hey I am making an issue because I saw a branch and made a comment but I don't know if comments email notify.
I'm wondering if you plan to merge that branch with the empty content length + head request, as I'm dealing with a million timeouts on an API which returns no content length when they have an error
The text was updated successfully, but these errors were encountered: