node-feature-discovery sends excessive LIST requests to the API server #1891
Labels
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
What happened: node-feature-discovery of gpu-operator sends excessive LIST requests to the API server
What you expected to happen:
Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a
LIST
request fromgpu-operator
. Here's the alert and rule that I'm using:histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10
I also found all
gpu-operator-node-feature-discovery-worker
pods are tried to sendGET
verb to API server to query thenodefeatures
resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:I think this is strange that it takes this long to process
LIST
requests when my k8s cluster only has 300 GPU nodes and whynode-feature-discovery-worker
pods are sendingGET
request every minute.Do you have any information about this problem?
If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.
Thanks!
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
node-feature-discovery
was deployed during installation withgpu-operator
from NVIDIA. I usedgpu-operator
v23.3.2 version.Environment:
kubectl version
): k8s v1.21.6, v1.29.5cat /etc/os-release
): Ubuntu 20.04.4 LTSuname -a
): 5.4.0-113-genericnfd
was deployed bygpu-operator
from NVIDIAcalico
The text was updated successfully, but these errors were encountered: