Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

previous node taint is wiped when kubelet restarts #119645

Open
Chaunceyctx opened this issue Jul 28, 2023 · 14 comments · May be fixed by #129485
Open

previous node taint is wiped when kubelet restarts #119645

Chaunceyctx opened this issue Jul 28, 2023 · 14 comments · May be fixed by #129485
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Chaunceyctx
Copy link
Contributor

Chaunceyctx commented Jul 28, 2023

What happened?

I have a k8s cluster(v1.27.2) containing one node. I set evictionHard: nodefs.available: 90% and write a large amount of data to kubelet rootdir(used 8GB / total 10GB ) to trigger eviction. node.kubernetes.io/disk-pressure taint was added to this node. But when kubelet restarted, previous disk-pressure taint was wiped weirdly. And pending pod is normally scheduled to run on the current node. Then I checked the kubelet logs and found:

kubelet restart:

I0727 11:09:05.782733 3813426 flags.go:64] FLAG: --tls-private-key-file=""
I0727 11:09:05.782737 3813426 flags.go:64] FLAG: --topology-manager-policy="none"
I0727 11:09:05.782741 3813426 flags.go:64] FLAG: --topology-manager-scope="container"
I0727 11:09:05.782745 3813426 flags.go:64] FLAG: --v="8"
I0727 11:09:05.782749 3813426 flags.go:64] FLAG: --version="false"
I0727 11:09:05.782754 3813426 flags.go:64] FLAG: --vmodule=""
I0727 11:09:05.782757 3813426 flags.go:64] FLAG: --volume-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/volume/exec/"
I0727 11:09:05.782761 3813426 flags.go:64] FLAG: --volume-stats-agg-period="1m0s"
I0727 11:09:05.796690 3813426 mount_linux.go:222] Detected OS with systemd

update node status NodeHasNoDiskPressure

I0727 11:09:10.858628 3813426 kubelet_node_status.go:632] "Recording event message for node" node="192.168.2.107" event="NodeHasNoDiskPressure"
I0727 11:09:10.858640 3813426 kubelet_node_status.go:762] "Setting node status condition code" position=9 node="192.168.2.107"

eviction manager start to synchronize

I0727 11:09:13.373393 3813426 eviction_manager.go:292] "Eviction manager: synchronize housekeeping"

Q: Why does kubelet report NodeHasNoDiskPressure ?
A: Eviction manager has not yet executed the synchronize method

Normally, a node that was previously marked with taints should retain after reboot otherwise it doesn't make sense. Maybe we need to make sure that taints can't be deleted when kubelet restarts?

What did you expect to happen?

previous node taint is not wiped

How can we reproduce it (as minimally and precisely as possible)?

Restart kubelet repeatedly after the disk pressure eviction is triggered. Observe node.spec.taints

Anything else we need to know?

No response

Kubernetes version

1.27.2

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Chaunceyctx Chaunceyctx added the kind/bug Categorizes issue or PR as related to a bug. label Jul 28, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 28, 2023
@Chaunceyctx
Copy link
Contributor Author

/sig node
/assign

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 28, 2023
@Chaunceyctx
Copy link
Contributor Author

Chaunceyctx commented Jul 28, 2023

@ffromani
Copy link
Contributor

/cc

@tzneal
Copy link
Contributor

tzneal commented Aug 2, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2023
@Chaunceyctx
Copy link
Contributor Author

Chaunceyctx commented Aug 3, 2023

@klueska @dashpole @liggitt @derekwaynecarr @pacoxu PTAL. Thanks!

@heartlock
Copy link
Contributor

heartlock commented Aug 11, 2023

I have encountered the same bug, is there a plan to fix it?

@SuQiucheng
Copy link

I hope the community can fix this bug as soon as possible.

@waiterQ
Copy link

waiterQ commented Aug 11, 2023

I also met this before, is there a fix plan?

@Chaunceyctx
Copy link
Contributor Author

@ffromani @tzneal I believe that the impact of this problem is actually very small, and the taints of the node will disappear briefly and then appear again. Therefore, I would like to ask if it is still necessary for us to repair this problem

@kannon92 kannon92 moved this to Triaged in SIG Node Bugs Jul 22, 2024
@SuQiucheng
Copy link

I also meet this problem, I wonder if there's any restoration value

@pacoxu
Copy link
Member

pacoxu commented Dec 24, 2024

@ffromani @tzneal I believe that the impact of this problem is actually very small, and the taints of the node will disappear briefly and then appear again. Therefore, I would like to ask if it is still necessary for us to repair this problem

I feel this is a bug that should be fixed, although the impact doesn't seem particularly significant.

During the kubelet startup process, a check should be performed when setting the condition, rather than clearing the taints first.

cc @SergeyKanzhelev @kannon92

@Chaunceyctx
Copy link
Contributor Author

@ffromani @tzneal I believe that the impact of this problem is actually very small, and the taints of the node will disappear briefly and then appear again. Therefore, I would like to ask if it is still necessary for us to repair this problem

I feel this is a bug that should be fixed, although the impact doesn't seem particularly significant.

During the kubelet startup process, a check should be performed when setting the condition, rather than clearing the taints first.

cc @SergeyKanzhelev @kannon92

yes, I think this is a TOCTOU bug. maybe I will rethink how to fix it. previous pr cannot fix this issue.

@pacoxu
Copy link
Member

pacoxu commented Jan 8, 2025

@Chaunceyctx can you reproduce it with v1.32? I cannot reproduce it with v1.32 locally.

I tried v1.27 and I can reproduce it. I am not sure why. Can you give it a try?

@Chaunceyctx
Copy link
Contributor Author

@Chaunceyctx can you reproduce it with v1.32? I cannot reproduce it with v1.32 locally.

I tried v1.27 and I can reproduce it. I am not sure why. Can you give it a try?

ok, I will try to reproduce it with v1.32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Triaged
8 participants