Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-feature-discovery sends excessive LIST requests to the API server #1891

Open
jslouisyou opened this issue Sep 30, 2024 · 3 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@jslouisyou
Copy link

What happened: node-feature-discovery of gpu-operator sends excessive LIST requests to the API server

What you expected to happen:
Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a LIST request from gpu-operator. Here's the alert and rule that I'm using:

  • Alert:
Long API server 99%-tile Latency
LIST: 29.90 seconds while nfd.k8s-sigs.io/v1alpha1/nodefeatures request.
  • Rule: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10

I also found all gpu-operator-node-feature-discovery-worker pods are tried to send GET verb to API server to query the nodefeatures resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"df926f36-8c1f-488e-ac88-11690e24660a","stage":"ResponseComplete","requestURI":"/apis/nfd.k8s-sigs.io/v1alpha1/namespaces/gpu-operator/nodefeatures/sra100-033","verb":"get","user":{"username":"system:serviceaccount:gpu-operator:node-feature-discovery","uid":"da2306ea-536f-455d-bf18-817299dd5489","groups":["system:serviceaccounts","system:serviceaccounts:gpu-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["gpu-operator-node-feature-discovery-worker-49qq6"],"authentication.kubernetes.io/pod-uid":["65dfb997-221e-4a5c-92df-7ff111ea6137"]}},"sourceIPs":["75.17.103.53"],"userAgent":"nfd-worker/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"nodefeatures","namespace":"gpu-operator","name":"sra100-033","apiGroup":"nfd.k8s-sigs.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2024-08-07T01:35:20.355504Z","stageTimestamp":"2024-08-07T01:35:20.676700Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"gpu-operator-node-feature-discovery\" of ClusterRole \"gpu-operator-node-feature-discovery\" to ServiceAccount \"node-feature-discovery/gpu-operator\""}}

I think this is strange that it takes this long to process LIST requests when my k8s cluster only has 300 GPU nodes and why node-feature-discovery-worker pods are sending GET request every minute.

Do you have any information about this problem?
If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.

Thanks!

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: node-feature-discovery was deployed during installation with gpu-operator from NVIDIA. I used gpu-operator v23.3.2 version.

Environment:

  • Kubernetes version (use kubectl version): k8s v1.21.6, v1.29.5
  • Cloud provider or hardware configuration: On-premise
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.4 LTS
  • Kernel (e.g. uname -a): 5.4.0-113-generic
  • Install tools: nfd was deployed by gpu-operator from NVIDIA
  • Network plugin and version (if this is a network-related bug): calico
@jslouisyou jslouisyou added the kind/bug Categorizes issue or PR as related to a bug. label Sep 30, 2024
@marquiz
Copy link
Contributor

marquiz commented Oct 9, 2024

@jslouisyou thank you for reporting the issue. I think you're hitting the issue what #1811 (and #1810, #1815) addresses. Those will be part of the upcoming v0.17 release of NFD.

A possible workaround for NFD v0.16 could be to run with NFD with -feature-gates NodeFeatureAPI=false. But I'm not sure if this is viable with the Nvidia gpu-operator. @ArangoGutierrez ?

Looks like need for v0.17 is urgent.

@jslouisyou
Copy link
Author

Thanks @marquiz for updating this issue!

It seems that NodeFeatureAPI can be disabled via gpu-operator from NVIDIA. Here's the code snippet from gpu-operator:

node-feature-discovery:
  enableNodeFeatureApi: true

ref) https://github.com/NVIDIA/gpu-operator/blob/5a94df5b8c8ec9d3a19212869cab726a3c5910b3/deployments/gpu-operator/values.yaml#L474-L475

But I'm not sure whether this feature is required from gpu-operator or not. Could @ArangoGutierrez please check whether this feature is required or not?

Thanks!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants