Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing labels when node NotReady #933

Conversation

TomerNewman
Copy link
Contributor

@TomerNewman TomerNewman commented Nov 17, 2024

Until now, nodes retained their kmod labels even when they became NotReady for any reason (e.g., a reboot).
Today, we are removing these labels when a node is NotReady to address a potential race condition.


Besides updating the unit tests, I also tested this manually with simple-kmod, and it works as expected (label was deleted when Node became NotReady).


Note

I am also checking whether to remove the label using IsDeprecatedKernelModuleReadyNodeLabel. If this is not relevant, please let me know, and I will update it accordingly.

/cc @yevgeny-shnaidman @ybettan

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 17, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @TomerNewman. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 17, 2024
Copy link

netlify bot commented Nov 17, 2024

Deploy Preview for kubernetes-sigs-kmm ready!

Name Link
🔨 Latest commit a2a4f0b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kmm/deploys/67445780397e5e000880667c
😎 Deploy Preview https://deploy-preview-933--kubernetes-sigs-kmm.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@yevgeny-shnaidman
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 18, 2024
Comment on lines 133 to 142
modifiedNode := node.DeepCopy()
patchFrom := client.MergeFrom(modifiedNode)
for label := range modifiedNode.GetLabels() {
if ok, _, _ := utils.IsKernelModuleReadyNodeLabel(label); ok ||
utils.IsDeprecatedKernelModuleReadyNodeLabel(label) {
delete(node.ObjectMeta.Labels, label)
}
}
if err := r.client.Patch(ctx, &node, patchFrom); err != nil {
return ctrl.Result{}, fmt.Errorf("could not patch node %s: %v", node.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use r.nodeAPI.UpdateLabels instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use r.nodeAPI.UpdateLabels instead?

Good idea

@TomerNewman TomerNewman force-pushed the bugfix/deleting-label-when-node-notready branch from e668a21 to bb7d3dd Compare November 18, 2024 17:08
if !r.nodeAPI.IsNodeSchedulable(&node) {
var labelsToRemove []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tomer, wdyt about moving this whole if section to a nodeAPI? This code does not require any data from nmc, it just looks at the node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@TomerNewman TomerNewman force-pushed the bugfix/deleting-label-when-node-notready branch from bb7d3dd to 3098b4b Compare November 19, 2024 10:49
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 19, 2024
NodeBecomeReadyAfter(node *v1.Node, checkTime metav1.Time) bool
RemoveAllKmodLabels(ctx context.Context, node *v1.Node) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RemoveAllKmodLabels(ctx context.Context, node *v1.Node) error
RemoveKmodReadyLabels(ctx context.Context, node *v1.Node) error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest renaming it "RemoveNodeReadyLabels", since it is not related to a specific kmod, but to any kmod on the node

Copy link
Contributor Author

@TomerNewman TomerNewman Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest renaming it "RemoveNodeReadyLabels", since it is not related to a specific kmod, but to any kmod on the node

RemoveNodeReadyLabels is not to general? I think for example a node can have a label for his Ready status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then RemoveNodeKmodReadyLabels. @ybettan wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with RemoveNodeReadyLabels.

There is no "ready" label on the node by default:

$ kubectl get node/minikube -o yaml | yq '.metadata.labels'
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: minikube
kubernetes.io/os: linux
minikube.k8s.io/commit: 5883c09216182566a63dff4c326a6fc9ed2982ff
minikube.k8s.io/name: minikube
minikube.k8s.io/primary: "true"
minikube.k8s.io/updated_at: 2024_11_25T11_06_54_0700
minikube.k8s.io/version: v1.33.1
node-role.kubernetes.io/control-plane: ""
node.kubernetes.io/exclude-from-external-load-balancers: ""

The "ready" on the node is a condition and not a label:

$ kubectl get node/minikube -o yaml | yq '.status.conditions[] | select (.type == "Ready")'
lastHeartbeatTime: "2024-11-25T09:07:08Z"
lastTransitionTime: "2024-11-25T09:07:08Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I changed it to RemoveNodeReadyLabels then.

@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.65%. Comparing base (fa23a9b) to head (a2a4f0b).
Report is 147 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #933      +/-   ##
==========================================
- Coverage   79.09%   72.65%   -6.45%     
==========================================
  Files          51       65      +14     
  Lines        5109     5756     +647     
==========================================
+ Hits         4041     4182     +141     
- Misses        882     1395     +513     
+ Partials      186      179       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@TomerNewman TomerNewman force-pushed the bugfix/deleting-label-when-node-notready branch from 3098b4b to bb35bc3 Compare November 25, 2024 10:05
@TomerNewman
Copy link
Contributor Author

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.60%. Comparing base (fa23a9b) to head (3098b4b).
Report is 146 commits behind head on main.

Files with missing lines Patch % Lines
internal/controllers/nmc_reconciler.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files

☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

* [Flaky Tests Detection](https://docs.codecov.com/docs/test-result-ingestion-beta) - Detect and resolve failed and flaky tests

Noticed, added another unit-test to catch the failing scenario.

Until now nodes kept their kmod labels when became not ready.
Today we are deleting them when they are not ready due to potential race condition.
@TomerNewman TomerNewman force-pushed the bugfix/deleting-label-when-node-notready branch from bb35bc3 to a2a4f0b Compare November 25, 2024 10:54
@ybettan
Copy link
Contributor

ybettan commented Nov 25, 2024

/approve
/assign @yevgeny-shnaidman

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: TomerNewman, ybettan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 25, 2024
@yevgeny-shnaidman
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 25, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3596563 into kubernetes-sigs:main Nov 25, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants