-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Autoscaling jobs issue after keda 2.16.1 upgrade #2542
Comments
Lets read my PR kedacore/keda#6437 |
@amardeep2006, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
If you want scaler trigger against request without cap |
This has left me bit confused. I thought hpa was applicable for only Deployments. Will it work for the Jobs as well? |
You also can create multiple scalers with different metadata easily under config |
Actually |
Thanks for your blazing fast response @VietND96 . I am testing both suggestions : passing platformName in capabilities and helm chart config one by one. |
As the target I mentioned in the PR is the Grid with autoscaling Nodes, non-autoscaling Nodes, relay Nodes, etc. The scaler needs to isolate and count the exact number of ongoing sessions + pending sessions (without overlapping if multiple ScaledJobs exist at a time) and then send it to KEDA. The rest of the work for metrics to K8s HPA that KEDA will take care. |
I confirm that in my initial set of sanity testing both suggestions worked. Appreciate your help. For now I will stick with keeping platformName as empty in value file as there are many teams using the solutions without platformName in capabilities and we donot have relay/windows node use case for now. I will share my long term observations over a week. |
Yes, I also will try to get time to write down all the details that users need to know to scale the Grid with KEDA 2.16.1+ |
In chart 0.38.3, the default value of |
After checking other issues I think my issue is mostly correlates with #2464 |
Hi, #2464 scaling type is Deployment, are you using it? If yes, |
Hi, |
With above screenshot, 9 ongoing sessions and 11 requests pending are having the same capabilities |
I also noticed that in this part:
Will always be
Which is different from another recommendation kedacore/keda#6437 (comment) |
Oops, thank you for your pointing out the template issue now. Since I removed the default and left it as empty in values.yaml, but the template is handled a different way |
Chrome node: https://hub.docker.com/layers/selenium/node-chrome/nightly/images/sha256-dcd1cc89e7c442fb66248945595da5180edc08c459bb3c796021dba7603ffded |
Wait a moment; I will bump chart 0.38.4 to fix a typo in values and this issue in the template. |
I also saw this, but it looks like Grid UI behavior will not impact DefaultSlotMatcher in the Grid function. |
@farioas chart |
Hi, But I'm still not happy with the overall performance after updating keda from Whereas before the selenium test pipeline completed in 26 minutes, now it takes 60 minutes. |
Is it due to not having enough Nodes scaled up to pick up the request instantly? |
To mitigate this, I've already added platformName to the test capabilities as well as in the scaledjob. The rest remains the same on the infra side. As far as I can tell, the problem lies somewhere in the keda calculations. I never used to see a queue size greater than 1, but now I often see 8, 10 and so on. |
I noted your feedback. Will try to reproduce and fix it if possible. |
we are also experiencing this performance degradation, we are deploying grid with the helm chart, when i added the platform name, the scaleJob because inactive, this is a serious issue for us, please assist |
I was able to achieve almost the same level of performance as I had in keda 2.15.1:
Removed |
Hi @farioas, the |
@farioas in my case they are both 0 but we still experience a 50% performance degradation |
No, it was about:
|
We're also seeing something like this. Keda is just not creating enough workers to empty the queue. There will be six jobs in the queue and Keda will only schedule three workers and refuse to schedule more. Oddly, it almost looks like Keda consistently schedules exactly half as many workers as there are pending and active jobs, rounded up. I think we're going to have to roll back unless this is fixed soon. |
What happened?
I tried upgrading the grid 4.27.0 from helm version 0.38.0 to 0.38.2(trunk branch) and the KEDA does not seems to be picking pending sessions from queue.
I am passing just on capability in my tests browserName: 'chrome'
Autoscaling type is Job. I have tried both default and accurate strategy. Is there some breaking change in 0.38.2.
Command used to start Selenium Grid with Docker (or Kubernetes)
Relevant log output
Operating System
v1.28.15-eks-7f9249a
Docker Selenium version (image tag)
4.27.0-20241225
Selenium Grid chart version (chart version)
0.38.2
The text was updated successfully, but these errors were encountered: