previous node taint is wiped when kubelet restarts #119645

Chaunceyctx · 2023-07-28T07:50:38Z

What happened?

I have a k8s cluster(v1.27.2) containing one node. I set evictionHard: nodefs.available: 90% and write a large amount of data to kubelet rootdir（used 8GB / total 10GB ） to trigger eviction. node.kubernetes.io/disk-pressure taint was added to this node. But when kubelet restarted, previous disk-pressure taint was wiped weirdly. And pending pod is normally scheduled to run on the current node. Then I checked the kubelet logs and found:

kubelet restart:

I0727 11:09:05.782733 3813426 flags.go:64] FLAG: --tls-private-key-file=""
I0727 11:09:05.782737 3813426 flags.go:64] FLAG: --topology-manager-policy="none"
I0727 11:09:05.782741 3813426 flags.go:64] FLAG: --topology-manager-scope="container"
I0727 11:09:05.782745 3813426 flags.go:64] FLAG: --v="8"
I0727 11:09:05.782749 3813426 flags.go:64] FLAG: --version="false"
I0727 11:09:05.782754 3813426 flags.go:64] FLAG: --vmodule=""
I0727 11:09:05.782757 3813426 flags.go:64] FLAG: --volume-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/volume/exec/"
I0727 11:09:05.782761 3813426 flags.go:64] FLAG: --volume-stats-agg-period="1m0s"
I0727 11:09:05.796690 3813426 mount_linux.go:222] Detected OS with systemd

update node status NodeHasNoDiskPressure

I0727 11:09:10.858628 3813426 kubelet_node_status.go:632] "Recording event message for node" node="192.168.2.107" event="NodeHasNoDiskPressure"
I0727 11:09:10.858640 3813426 kubelet_node_status.go:762] "Setting node status condition code" position=9 node="192.168.2.107"

eviction manager start to synchronize

I0727 11:09:13.373393 3813426 eviction_manager.go:292] "Eviction manager: synchronize housekeeping"

Q: Why does kubelet report NodeHasNoDiskPressure ?
A: Eviction manager has not yet executed the synchronize method

Normally, a node that was previously marked with taints should retain after reboot otherwise it doesn't make sense. Maybe we need to make sure that taints can't be deleted when kubelet restarts?

What did you expect to happen?

previous node taint is not wiped

How can we reproduce it (as minimally and precisely as possible)?

Restart kubelet repeatedly after the disk pressure eviction is triggered. Observe node.spec.taints

Anything else we need to know?

No response

Kubernetes version

1.27.2

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

Chaunceyctx · 2023-07-28T07:51:24Z

/sig node
/assign

Chaunceyctx · 2023-07-28T08:49:37Z

@bobbypage @liggitt @tallclair @Random-Liu PTAL

ffromani · 2023-07-31T11:09:07Z

/cc

tzneal · 2023-08-02T17:51:15Z

/triage accepted

Chaunceyctx · 2023-08-03T07:18:59Z

@klueska @dashpole @liggitt @derekwaynecarr @pacoxu PTAL. Thanks!

heartlock · 2023-08-11T01:55:52Z

I have encountered the same bug, is there a plan to fix it?

SuQiucheng · 2023-08-11T02:20:10Z

I hope the community can fix this bug as soon as possible.

waiterQ · 2023-08-11T02:28:00Z

I also met this before, is there a fix plan?

Chaunceyctx · 2024-05-07T01:40:57Z

@ffromani @tzneal I believe that the impact of this problem is actually very small, and the taints of the node will disappear briefly and then appear again. Therefore, I would like to ask if it is still necessary for us to repair this problem

SuQiucheng · 2024-12-24T08:39:00Z

I also meet this problem, I wonder if there's any restoration value

pacoxu · 2024-12-24T08:57:36Z

@ffromani @tzneal I believe that the impact of this problem is actually very small, and the taints of the node will disappear briefly and then appear again. Therefore, I would like to ask if it is still necessary for us to repair this problem

I feel this is a bug that should be fixed, although the impact doesn't seem particularly significant.

During the kubelet startup process, a check should be performed when setting the condition, rather than clearing the taints first.

cc @SergeyKanzhelev @kannon92

Chaunceyctx · 2024-12-24T09:27:46Z

@ffromani @tzneal I believe that the impact of this problem is actually very small, and the taints of the node will disappear briefly and then appear again. Therefore, I would like to ask if it is still necessary for us to repair this problem

I feel this is a bug that should be fixed, although the impact doesn't seem particularly significant.

During the kubelet startup process, a check should be performed when setting the condition, rather than clearing the taints first.

cc @SergeyKanzhelev @kannon92

yes, I think this is a TOCTOU bug. maybe I will rethink how to fix it. previous pr cannot fix this issue.

pacoxu · 2025-01-08T09:06:24Z

@Chaunceyctx can you reproduce it with v1.32? I cannot reproduce it with v1.32 locally.

I tried v1.27 and I can reproduce it. I am not sure why. Can you give it a try?

Chaunceyctx · 2025-01-08T09:48:24Z

@Chaunceyctx can you reproduce it with v1.32? I cannot reproduce it with v1.32 locally.

I tried v1.27 and I can reproduce it. I am not sure why. Can you give it a try?

ok, I will try to reproduce it with v1.32

Chaunceyctx added the kind/bug Categorizes issue or PR as related to a bug. label Jul 28, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 28, 2023

k8s-ci-robot assigned Chaunceyctx Jul 28, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 28, 2023

Chaunceyctx mentioned this issue Jul 28, 2023

when kubelet restarts pid/memory/disk pressure taints will not be wiped #119647

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2023

kannon92 added this to SIG Node Bugs Jul 22, 2024

kannon92 moved this to Triaged in SIG Node Bugs Jul 22, 2024

pacoxu linked a pull request Jan 6, 2025 that will close this issue

kubelet: start eviction manager before initializing node conditions #129485

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

previous node taint is wiped when kubelet restarts #119645

previous node taint is wiped when kubelet restarts #119645

Chaunceyctx commented Jul 28, 2023 •

edited

Loading

Chaunceyctx commented Jul 28, 2023

Chaunceyctx commented Jul 28, 2023 •

edited

Loading

ffromani commented Jul 31, 2023

tzneal commented Aug 2, 2023

Chaunceyctx commented Aug 3, 2023 •

edited

Loading

heartlock commented Aug 11, 2023 •

edited

Loading

SuQiucheng commented Aug 11, 2023

waiterQ commented Aug 11, 2023

Chaunceyctx commented May 7, 2024

SuQiucheng commented Dec 24, 2024

pacoxu commented Dec 24, 2024

Chaunceyctx commented Dec 24, 2024

pacoxu commented Jan 8, 2025

Chaunceyctx commented Jan 8, 2025

previous node taint is wiped when kubelet restarts #119645

previous node taint is wiped when kubelet restarts #119645

Comments

Chaunceyctx commented Jul 28, 2023 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Chaunceyctx commented Jul 28, 2023

Chaunceyctx commented Jul 28, 2023 • edited Loading

ffromani commented Jul 31, 2023

tzneal commented Aug 2, 2023

Chaunceyctx commented Aug 3, 2023 • edited Loading

heartlock commented Aug 11, 2023 • edited Loading

SuQiucheng commented Aug 11, 2023

waiterQ commented Aug 11, 2023

Chaunceyctx commented May 7, 2024

SuQiucheng commented Dec 24, 2024

pacoxu commented Dec 24, 2024

Chaunceyctx commented Dec 24, 2024

pacoxu commented Jan 8, 2025

Chaunceyctx commented Jan 8, 2025

Chaunceyctx commented Jul 28, 2023 •

edited

Loading

Chaunceyctx commented Jul 28, 2023 •

edited

Loading

Chaunceyctx commented Aug 3, 2023 •

edited

Loading

heartlock commented Aug 11, 2023 •

edited

Loading