Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Autoscaling jobs issue after keda 2.16.1 upgrade #2542

Open
amardeep2006 opened this issue Dec 27, 2024 · 33 comments
Open

[🐛 Bug]: Autoscaling jobs issue after keda 2.16.1 upgrade #2542

amardeep2006 opened this issue Dec 27, 2024 · 33 comments
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA R-awaiting-retest

Comments

@amardeep2006
Copy link
Contributor

What happened?

I tried upgrading the grid 4.27.0 from helm version 0.38.0 to 0.38.2(trunk branch) and the KEDA does not seems to be picking pending sessions from queue.
I am passing just on capability in my tests browserName: 'chrome'

capabilities: [{
    browserName: 'chrome',
    'se:downloadsEnabled': true
}],

Autoscaling type is Job. I have tried both default and accurate strategy. Is there some breaking change in 0.38.2.

Command used to start Selenium Grid with Docker (or Kubernetes)

Using helm 0.38.2 
Note : I have taken the helm chart from trunk branch today and it has latest commit https://github.com/SeleniumHQ/docker-selenium/commit/d01680cba3feb3d050d9ff667aaa9816fca8e33a


global:
  seleniumGrid:
    # Image registry for all selenium components
    imageRegistry: myrepo/selenium-grid
    # Image tag for all selenium components
    imageTag: "4.27.0-20241225"
    # Image tag for browser's nodes
    nodesImageTag: "4.27.0-20241225"
    # Image tag for browser's video recorder
    imagePullSecret: ""
    # Log level for all components. Possible values describe here: https://www.selenium.dev/documentation/grid/configuration/cli_options/#logging
    logLevel: INFO
    # -- Whether to enable structured logging
    structuredLogs: true    
    # kubectl image is used to execute kubectl commands in utility jobs
    kubectlImage: myrepo/bitnami/kubectl:latest
isolateComponents: false
# Basic auth settings for Selenium Grid
basicAuth:
  # Enable or disable basic auth
  enabled: true
  # -- Username for basic auth
  username: $GRID_USERNAME
  # -- Password for basic auth
  password: $GRID_PASSWORD  
  # -- Embed the basic auth "u:p@" in few URLs e.g. SE_NODE_GRID_URL.
  embeddedUrl: false
autoscaling:
  enabled: true
  scalingType: job
  scaledOptions:
    minReplicaCount: 0
    maxReplicaCount: $MAX_REPLICAS_COUNT
    pollingInterval: 20  
  # terminationGracePeriodSeconds: 5400 #default
  # Options for KEDA ScaledJobs (only used when scalingType is set to "job"). See https://keda.sh/docs/latest/concepts/scaling-jobs/#scaledjob-spec
  scaledJobOptions:
    scalingStrategy:
      # Change this from "default" to "accurate" or "eager" when the calculation problem is fixed
      # -- Scaling strategy for KEDA ScaledJob
      strategy: default  
customLabels: {"app-id": "selgrid", "app-tier": "application", "app-name": "selgrid"}
tls:
  ingress:
    enabled: true
ingress:
  # Name of ingress class to select which controller will implement ingress resource
  # Custom annotations for ingress resource
  annotations:
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "900"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "900"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "900"
    nginx.ingress.kubernetes.io/proxy-body-size: 100m
  # Default host for the ingress resource
  hostname: $INGRESS_DOMAIN
  tls:
    - secretName: sel-grid-tls-secret-dynamic
      hosts:
        - $INGRESS_DOMAIN
router:
  imagePullPolicy: Always
distributor:
  imagePullPolicy: Always  
eventBus:
  imagePullPolicy: Always
sessionMap:
  imagePullPolicy: Always
sessionQueue:
  imagePullPolicy: Always    
hub:
  # Custom sub path for the hub deployment
  # subPath: /selenium  
  imagePullPolicy: Always
  # Resources for container
  resources:
    requests:
      memory: "200Mi"
      cpu: "100m"
    limits:
      memory: "9Gi"
      cpu: "4"
  extraEnvironmentVariables:
    - name: SE_JAVA_OPTS
      value: "-Xmx8192m"  
chromeNode:
# Number of chrome nodes Only used when Autoscaling is false
  replicas: $FIXED_CHROME_REPLICAS
  imagePullPolicy: Always
  # /dev/shm volume
  dshmVolumeSizeLimit: "2Gi"
  # Resources for chrome-node container
  resources:
    requests:
      memory: "100Mi"
      cpu: "100m"
    limits:
      memory: "2Gi"
      cpu: "2"
  extraEnvironmentVariables: 
    - name: "SE_NODE_ENABLE_MANAGED_DOWNLOADS"
      value: "true"
    # - name: "SE_VNC_NO_PASSWORD"
    #   value: "1"
    # - name: "SE_VNC_VIEW_ONLY"
    #   value: "1"
  terminationGracePeriodSeconds: 5400
edgeNode:
  replicas: $FIXED_EDGE_REPLICAS
  imagePullPolicy: Always
  # /dev/shm volume
  dshmVolumeSizeLimit: "2Gi"
  # Resources for edge-node container
  resources:
    requests:
      memory: "100Mi"
      cpu: "100m"
    limits:
      memory: "2Gi"
      cpu: "2"
  extraEnvironmentVariables: 
    - name: "SE_NODE_ENABLE_MANAGED_DOWNLOADS"
      value: "true"
    # - name: "SE_VNC_NO_PASSWORD"
    #   value: "1"
    # - name: "SE_VNC_VIEW_ONLY"
    #   value: "1"
  terminationGracePeriodSeconds: 5400
firefoxNode:
  enabled: false
  imagePullPolicy: Always
  # /dev/shm volume
  dshmVolumeSizeLimit: "2Gi"
  # Resources for firefox-node container
  resources:
    requests:
      memory: "1Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu: "2"
  extraEnvironmentVariables: 
    - name: "SE_NODE_ENABLE_MANAGED_DOWNLOADS"
      value: "true"
    # - name: "SE_VNC_NO_PASSWORD"
    #   value: "1"
    # - name: "SE_VNC_VIEW_ONLY"
    #   value: "1"
  autoscaling:
    scaledOptions:
      minReplicaCount: 0
      maxReplicaCount: 3
  terminationGracePeriodSeconds: 5400
keda:
  image:  
    keda:
      registry: myrepo
      # -- Image name of KEDA operator
      repository: myrepo/kedacore/keda
      # -- Image tag of KEDA operator. Optional, given app version of Helm chart is used by default
      tag: $KEDA_VERSION
    metricsApiServer:
      registry: myrepo
      # -- Image name of KEDA Metrics API Server
      repository: myrepo/kedacore/keda-metrics-apiserver
      # -- Image tag of KEDA Metrics API Server. Optional, given app version of Helm chart is used by default
      tag: $KEDA_VERSION
    webhooks:
      registry: myrepo
      # -- Image name of KEDA admission-webhooks
      repository: myrepo/kedacore/keda-admission-webhooks
      # -- Image tag of KEDA admission-webhooks . Optional, given app version of Helm chart is used by default
      tag: $KEDA_VERSION
    # -- Image pullPolicy for all KEDA components
  podLabels:
    # -- Pod labels for KEDA operator
    keda: {"app-id": "selgrid", "app-tier": "application", "app-name": "selgrid"}
    # -- Pod labels for KEDA Metrics Adapter
    metricsAdapter: {"app-id": "selgrid", "app-tier": "application", "app-name": "selgrid"}
    # -- Pod labels for KEDA Admission webhooks
    webhooks: {"app-id": "selgrid", "app-tier": "application", "app-name": "selgrid"}

Relevant log output

2024-12-27T10:06:00Z	INFO	setup	maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-12-27T10:06:00Z	INFO	setup	Starting manager
2024-12-27T10:06:00Z	INFO	setup	KEDA Version: 2.16.1
2024-12-27T10:06:00Z	INFO	setup	Git Commit: ce14b239e0300f388b0425aef68154d8070cd66f
2024-12-27T10:06:00Z	INFO	setup	Go Version: go1.23.4
2024-12-27T10:06:00Z	INFO	setup	Go OS/Arch: linux/amd64
2024-12-27T10:06:00Z	INFO	setup	Running on Kubernetes 1.28+	{"version": "v1.28.15-eks-7f9249a"}
2024-12-27T10:06:00Z	INFO	setup	WARNING: KEDA 2.16.1 hasn't been tested on Kubernetes v1.28.15-eks-7f9249a
2024-12-27T10:06:00Z	INFO	setup	You can check recommended versions on https://keda.sh
2024-12-27T10:06:00Z	INFO	starting server	{"name": "health probe", "addr": "[::]:8081"}
I1227 10:06:00.649846       1 leaderelection.go:254] attempting to acquire leader lease mer-merselgrid-dev-mer-sel-grid/operator.keda.sh...
I1227 10:06:17.014943       1 leaderelection.go:268] successfully acquired lease mer-merselgrid-dev-mer-sel-grid/operator.keda.sh
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource", "source": "kind source: *v1alpha1.ClusterCloudEventSource"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-12-27T10:06:17Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-12-27T10:06:17Z	INFO	Starting Controller	{"controller": "cert-rotator"}
2024-12-27T10:06:17Z	INFO	cert-rotation	starting cert rotator controller
2024-12-27T10:06:17Z	INFO	cert-rotation	no cert refresh needed
2024-12-27T10:06:17Z	INFO	cert-rotation	certs are ready in /certs
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "cert-rotator", "worker count": 1}
2024-12-27T10:06:17Z	INFO	cert-rotation	no cert refresh needed
2024-12-27T10:06:17Z	ERROR	cert-rotation	Webhook not found. Unable to update certificate.	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io \"keda-admission\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
	/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:822
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
	/workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:791
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224
2024-12-27T10:06:17Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-12-27T10:06:17Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"org-selgrid-selenium-node-chrome","namespace":"mer-merselgrid-dev-mer-sel-grid"}, "namespace": "mer-merselgrid-dev-mer-sel-grid", "name": "org-selgrid-selenium-node-chrome", "reconcileID": "c7dd5fba-37ed-495f-8796-8286dad16274"}
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource", "worker count": 1}
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
2024-12-27T10:06:17Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-12-27T10:06:17Z	INFO	KubeAPIWarningLogger	unknown field "status.authenticationsTypes"
2024-12-27T10:06:17Z	INFO	KubeAPIWarningLogger	unknown field "status.triggersTypes"
2024-12-27T10:06:17Z	INFO	RolloutStrategy: immediate, No jobs owned by the previous version of the scaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"org-selgrid-selenium-node-chrome","namespace":"mer-merselgrid-dev-mer-sel-grid"}, "namespace": "mer-merselgrid-dev-mer-sel-grid", "name": "org-selgrid-selenium-node-chrome", "reconcileID": "c7dd5fba-37ed-495f-8796-8286dad16274"}
2024-12-27T10:06:17Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"org-selgrid-selenium-node-chrome","namespace":"mer-merselgrid-dev-mer-sel-grid"}, "namespace": "mer-merselgrid-dev-mer-sel-grid", "name": "org-selgrid-selenium-node-chrome", "reconcileID": "c7dd5fba-37ed-495f-8796-8286dad16274"}
2024-12-27T10:06:17Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "org-selgrid-selenium-node-chrome", "scaledJob.Namespace": "mer-merselgrid-dev-mer-sel-grid", "Number of running Jobs": 0}
2024-12-27T10:06:17Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "org-selgrid-selenium-node-chrome", "scaledJob.Namespace": "mer-merselgrid-dev-mer-sel-grid", "Number of pending Jobs": 0}
2024-12-27T10:06:17Z	INFO	Reconciling ScaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"org-selgrid-selenium-node-edge","namespace":"mer-merselgrid-dev-mer-sel-grid"}, "namespace": "mer-merselgrid-dev-mer-sel-grid", "name": "org-selgrid-selenium-node-edge", "reconcileID": "0c1ff265-5fa8-41f4-a218-776a7bab2dd8"}
2024-12-27T10:06:17Z	INFO	RolloutStrategy: immediate, No jobs owned by the previous version of the scaledJob	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"org-selgrid-selenium-node-edge","namespace":"mer-merselgrid-dev-mer-sel-grid"}, "namespace": "mer-merselgrid-dev-mer-sel-grid", "name": "org-selgrid-selenium-node-edge", "reconcileID": "0c1ff265-5fa8-41f4-a218-776a7bab2dd8"}
2024-12-27T10:06:17Z	INFO	Initializing Scaling logic according to ScaledJob Specification	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"org-selgrid-selenium-node-edge","namespace":"mer-merselgrid-dev-mer-sel-grid"}, "namespace": "mer-merselgrid-dev-mer-sel-grid", "name": "org-selgrid-selenium-node-edge", "reconcileID": "0c1ff265-5fa8-41f4-a218-776a7bab2dd8"}
2024-12-27T10:06:17Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "org-selgrid-selenium-node-edge", "scaledJob.Namespace": "mer-merselgrid-dev-mer-sel-grid", "Number of running Jobs": 0}
2024-12-27T10:06:17Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "org-selgrid-selenium-node-edge", "scaledJob.Namespace": "mer-merselgrid-dev-mer-sel-grid", "Number of pending Jobs": 0}

Operating System

v1.28.15-eks-7f9249a

Docker Selenium version (image tag)

4.27.0-20241225

Selenium Grid chart version (chart version)

0.38.2

@VietND96
Copy link
Member

Lets read my PR kedacore/keda#6437
Now the client should send the request with set capability platformName matches with KEDA scaler metadata

Copy link

@amardeep2006, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

If you want scaler trigger against request without cap platformName. In chart, under config hpa of each Node, set metadata platformName to empty (by default it is Linux in chart).

@amardeep2006
Copy link
Contributor Author

This has left me bit confused. I thought hpa was applicable for only Deployments. Will it work for the Jobs as well?

@VietND96
Copy link
Member

You also can create multiple scalers with different metadata easily under config crossBrowsers, it is an array map with each item scheme is same as each node structure. Checkout values file cross-browsers-values.yaml for sample

@VietND96
Copy link
Member

Actually hpa is the config key name, it doesn't form the scaling type Job or Deployment, it keeps the scaler trigger params to pass to ScaledObject or ScaledJob. Only autoscaling.scalingType to decide that.

@amardeep2006
Copy link
Contributor Author

Thanks for your blazing fast response @VietND96 . I am testing both suggestions : passing platformName in capabilities and helm chart config one by one.
I will reply back on this issue.

@VietND96
Copy link
Member

VietND96 commented Dec 27, 2024

As the target I mentioned in the PR is the Grid with autoscaling Nodes, non-autoscaling Nodes, relay Nodes, etc. The scaler needs to isolate and count the exact number of ongoing sessions + pending sessions (without overlapping if multiple ScaledJobs exist at a time) and then send it to KEDA. The rest of the work for metrics to K8s HPA that KEDA will take care.
The scaling behavior is only correct when the scaler implementation can count and expose the correct number to KEDA.
To ensure that counting is correct, we need to implement the condition to trigger the scaler strictly.
Config in 3 areas we need to match (avoid other existing Nodes impact) request capabilities, node stereotypes, and scaler trigger params.

@amardeep2006
Copy link
Contributor Author

amardeep2006 commented Dec 27, 2024

I confirm that in my initial set of sanity testing both suggestions worked. Appreciate your help. For now I will stick with keeping platformName as empty in value file as there are many teams using the solutions without platformName in capabilities and we donot have relay/windows node use case for now. I will share my long term observations over a week.

@VietND96
Copy link
Member

Yes, I also will try to get time to write down all the details that users need to know to scale the Grid with KEDA 2.16.1+

@VietND96 VietND96 pinned this issue Dec 27, 2024
@VietND96 VietND96 added R-awaiting-retest I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA and removed needs-triaging labels Jan 2, 2025
@VietND96
Copy link
Member

VietND96 commented Jan 2, 2025

In chart 0.38.3, the default value of platformName is updated to empty. Make it work in the common use case where users get started.
For advanced Grid with multiple Node stereotypes, let user set the platformName to isolate for autoscaling Nodes.

@farioas
Copy link

farioas commented Jan 3, 2025

After upgrade to 0.38.3 and keda 2.16.1 with strategy: default I can see the similar behavior as in the 1st message.

image

Number of pending Jobs is 0 all the time

2025-01-03T18:53:32Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}

Here're capabilites set for the session:

{
  "browserName": "chrome",
  "goog:chromeOptions": {
    "args": [
      "--safebrowsing-disable-download-protection",
      "--start-maximized",
      "--remote-allow-origins=*"
    ],
    "extensions": [],
    "prefs": {
      "profile.default_content_settings.popups": 0,
      "download.default_directory": "/home/seluser/Downloads/"
    }
  }
}

And here're triggers from ScaledJob object:

  triggers:
  - authenticationRef:
      name: selenium-grid-selenium-scaler-trigger-auth
    metadata:
      browserName: chrome
      browserVersion: ""
      nodeMaxSessions: "1"
      platformName: ""
      sessionBrowserName: chrome
      unsafeSsl: "false"
    type: selenium-grid
    useCachedMetrics: false

@farioas
Copy link

farioas commented Jan 3, 2025

After checking other issues I think my issue is mostly correlates with #2464

@VietND96
Copy link
Member

VietND96 commented Jan 3, 2025

Hi, #2464 scaling type is Deployment, are you using it? If yes, strategy: default will not affect it since it applies to Job.

@farioas
Copy link

farioas commented Jan 3, 2025

Hi,
No, I'm using scalingType: job.

@VietND96
Copy link
Member

VietND96 commented Jan 3, 2025

With above screenshot, 9 ongoing sessions and 11 requests pending are having the same capabilities browserName, and no specifiy platformName right? How about the Node image tag, all are the same 20250101?

@farioas
Copy link

farioas commented Jan 3, 2025

I also noticed that in this part:

        {{- if and (eq (include "seleniumGrid.useKEDA" $) "true") }}
          - name: SE_NODE_PLATFORM_NAME
            value: {{ default "Linux" .node.hpa.platformName | quote }}
        {{- end }}

Will always be Linux, since "" considered as null. So in ScaledJob I have the following env variables:

          - name: SE_NODE_BROWSER_VERSION
            value: ""
          - name: SE_NODE_PLATFORM_NAME
            value: Linux

Which is different from another recommendation kedacore/keda#6437 (comment)

@VietND96
Copy link
Member

VietND96 commented Jan 3, 2025

Oops, thank you for your pointing out the template issue now. Since I removed the default and left it as empty in values.yaml, but the template is handled a different way

@farioas
Copy link

farioas commented Jan 3, 2025

With above screenshot, 9 ongoing sessions and 11 requests pending are having the same capabilities browserName, and no specifiy platformName right? How about the Node image tag, all are the same 20250101?

Chrome node: https://hub.docker.com/layers/selenium/node-chrome/nightly/images/sha256-dcd1cc89e7c442fb66248945595da5180edc08c459bb3c796021dba7603ffded
Hub: https://hub.docker.com/layers/selenium/hub/4.27.0-20250101/images/sha256-f2d8ad305ab19542096d929e8262fee96c6215aca4a15ad35495fec960570ede

@VietND96
Copy link
Member

VietND96 commented Jan 3, 2025

Wait a moment; I will bump chart 0.38.4 to fix a typo in values and this issue in the template.

@farioas
Copy link

farioas commented Jan 3, 2025

Trying this approach:

- name: SE_NODE_PLATFORM_NAME
  value: {{ if hasKey .node.hpa "platformName" }}{{ .node.hpa.platformName | quote }}{{ else }}"Linux"{{ end }}

Looks like if platform is not set, it's considered as Windows

image

@VietND96
Copy link
Member

VietND96 commented Jan 3, 2025

I also saw this, but it looks like Grid UI behavior will not impact DefaultSlotMatcher in the Grid function.

@VietND96
Copy link
Member

VietND96 commented Jan 6, 2025

@farioas chart 0.38.4 is out with the fix. Can you try and confirm?

@farioas
Copy link

farioas commented Jan 6, 2025

Hi,
I tested it on Friday, so far so good.

But I'm still not happy with the overall performance after updating keda from 2.15.1 to 2.16.1 and selenium-grid from 0.37.1 to 0.38.4.

Whereas before the selenium test pipeline completed in 26 minutes, now it takes 60 minutes.

@VietND96
Copy link
Member

VietND96 commented Jan 6, 2025

Is it due to not having enough Nodes scaled up to pick up the request instantly?
I know probably due to "platformName": "" is assumed as windows when registered to Hub. It leads to the scaler counting an incorrect number of sessions pending and ongoing, then scaling out incorrectly.
From Grid core, I added a PR to handle to the platformName is empty: SeleniumHQ/selenium#15036
Once it can be finalized, I will come to the scaler implementation and update the counting to be aligned with core behavior. Hopefully it will help.

@farioas
Copy link

farioas commented Jan 6, 2025

To mitigate this, I've already added platformName to the test capabilities as well as in the scaledjob. The rest remains the same on the infra side.

As far as I can tell, the problem lies somewhere in the keda calculations. I never used to see a queue size greater than 1, but now I often see 8, 10 and so on.

@VietND96
Copy link
Member

VietND96 commented Jan 6, 2025

I noted your feedback. Will try to reproduce and fix it if possible.
In CI also there is a small scale test and publish result here https://github.com/SeleniumHQ/docker-selenium/blob/trunk/.keda/results_test_k8s_autoscaling_job_count_strategy_default.md - where could be seen new pods equal to new requests incoming after few iterations.

@KyriosGN0
Copy link
Contributor

we are also experiencing this performance degradation, we are deploying grid with the helm chart, when i added the platform name, the scaleJob because inactive, this is a serious issue for us, please assist

@farioas
Copy link

farioas commented Jan 8, 2025

I was able to achieve almost the same level of performance as I had in keda 2.15.1:

autoscaling:
  scaledJobOptions:
    scalingStrategy:
     successfulJobsHistoryLimit: 0
     failedJobsHistoryLimit: 0
  scaledOptions:
    pollingInterval: 10

Removed historylimit for Jobs, so it's set to 0 (it was set to 1 before) and decreased pollingInterval from 20 to 10 seconds

@VietND96
Copy link
Member

VietND96 commented Jan 9, 2025

Hi @farioas, the historylimit is your additional config, right? Since I remember, this has not been included in the default values.

@KyriosGN0
Copy link
Contributor

@farioas in my case they are both 0 but we still experience a 50% performance degradation

@farioas
Copy link

farioas commented Jan 9, 2025

Hi @farioas, the historylimit is your additional config, right? Since I remember, this has not been included in the default values.

No, it was about:

     successfulJobsHistoryLimit: 0
     failedJobsHistoryLimit: 0

@isc-aray
Copy link

isc-aray commented Jan 21, 2025

We're also seeing something like this. Keda is just not creating enough workers to empty the queue. There will be six jobs in the queue and Keda will only schedule three workers and refuse to schedule more. Oddly, it almost looks like Keda consistently schedules exactly half as many workers as there are pending and active jobs, rounded up. I think we're going to have to roll back unless this is fixed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA R-awaiting-retest
Projects
None yet
Development

No branches or pull requests

5 participants