Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add configurable pagination to nfd-master #2000

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ivelichkovich
Copy link

addresses scalability and api-server load concerns for large clusters by adding configurable pagination to the informer cache of nfd-master

related to: #1998

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ivelichkovich
Once this PR has been reviewed and has the lgtm label, please assign marquiz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @ivelichkovich!

It looks like this is your first PR to kubernetes-sigs/node-feature-discovery 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/node-feature-discovery has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @ivelichkovich. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 4, 2025
Copy link

netlify bot commented Jan 4, 2025

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit 0e1fe4d
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/677a04e11a67c500089fa099
😎 Deploy Preview https://deploy-preview-2000--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 4, 2025
@ivelichkovich
Copy link
Author

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@ivelichkovich: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@TessaIO TessaIO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work, small comment out there. Can you please add the documentation for this new option?

@@ -122,6 +122,8 @@ func initFlags(flagset *flag.FlagSet) (*master.Args, *master.ConfigOverrideArgs)
"in the same format as in the config file (i.e. json or yaml). These options")
flagset.BoolVar(&args.EnableLeaderElection, "enable-leader-election", false,
"Enables a leader election. Enable this when running more than one replica on nfd master.")
flagset.Int64Var(&args.ListSize, "node-feature-informer-list-size", 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a shorter way like this.

Suggested change
flagset.Int64Var(&args.ListSize, "node-feature-informer-list-size", 0,
flagset.Int64Var(&args.ListSize, "informer-list-size", 0,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the flag and updated docs, let me know if you'd prefer the doc update somewhere else

@ivelichkovich
Copy link
Author

ivelichkovich commented Jan 5, 2025

I wasn't sure if we'd want to default to 500 (default list pagination size) to keep new default behavior consistent with old behavior or set it to default to 200 to match the gc pagination default size. I'm open to either but would lean towards making them consistent and setting the default here to 200, ref: https://github.com/kubernetes-sigs/node-feature-discovery/pull/2001/files

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 5, 2025
@ivelichkovich ivelichkovich marked this pull request as draft January 5, 2025 03:16
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 5, 2025
@ivelichkovich ivelichkovich marked this pull request as ready for review January 5, 2025 03:38
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from marquiz January 5, 2025 03:38
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 5, 2025
@@ -101,6 +102,7 @@ func newNfdController(config *restclient.Config, nfdApiControllerOptions nfdApiC
if opts.ResourceVersion == "0" {
Copy link
Author

@ivelichkovich ivelichkovich Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw noticed the (TODO: find out why) about scalability of this resource version override. While researching the pagination stuff I think this is likely due to this snippet of code: https://github.com/kubernetes/kubernetes/blob/ace55542575fb098b3e413692bbe2bc20d2348ba/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L600-L616 if you set resource version to 0 it serves the request from apiservers cache and doesn't use pagination otherwise pagination will default to 500 so that may explain why it blows up on large clusters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by setting this we're making it go to ETCD instead of from api-server cache, I found some WIP in k/k that seems to imply they're working on improving this behavior where you'll be able to paginate from apiserver cache but AFAICT it's not supported yet, would be good to track this though kubernetes/kubernetes#108003

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ivelichkovich for the enhancement. A few small comments below
/ok-to-test
/cc @ArangoGutierrez @adrianchiris

@@ -122,6 +122,8 @@ func initFlags(flagset *flag.FlagSet) (*master.Args, *master.ConfigOverrideArgs)
"in the same format as in the config file (i.e. json or yaml). These options")
flagset.BoolVar(&args.EnableLeaderElection, "enable-leader-election", false,
"Enables a leader election. Enable this when running more than one replica on nfd master.")
flagset.Int64Var(&args.ListSize, "informer-list-size", 200,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pondering (bike-shedding) on the naming. Would -informer-paginate or -informer-page-size be more descriptive, thoughts?

@@ -122,6 +122,8 @@ func initFlags(flagset *flag.FlagSet) (*master.Args, *master.ConfigOverrideArgs)
"in the same format as in the config file (i.e. json or yaml). These options")
flagset.BoolVar(&args.EnableLeaderElection, "enable-leader-election", false,
"Enables a leader election. Enable this when running more than one replica on nfd master.")
flagset.Int64Var(&args.ListSize, "informer-list-size", 200,
"The list size to use when listing node features to sync informer cache.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use the name of the CRD.

Suggested change
"The list size to use when listing node features to sync informer cache.")
"The list size to use when listing NodeFeature objects to sync informer cache.")

### -informer-list-size

The `-informer-list-size` flag is used to control pagination during informer cache sync on nfd-master startup.
This is useful to control load on api-server/etcd as listing `nodefeatures` can be expensive, especially in large clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is useful to control load on api-server/etcd as listing `nodefeatures` can be expensive, especially in large clusters.
This is useful to control load on api-server/etcd as listing NodeFeature objects can be expensive, especially in large clusters.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 8, 2025
@k8s-ci-robot
Copy link
Contributor

@ivelichkovich: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-node-feature-discovery-verify-docs-master 0e1fe4d link true /test pull-node-feature-discovery-verify-docs-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants