-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307
Comments
No idea if this is related but googling for CPU soft lockup and a few other keywords led me to this issue: https://bbs.archlinux.org/viewtopic.php?id=264127&p=3 that appears to be affected up until Kernel 5.15 (which is what we are running it appears) |
Hey @snowzach, thanks for reporting this issue! A question:
|
@koooosh the ENA "TX hasn't completed" error is probably the most relevant here. awslabs/amazon-eks-ami#1704 mentions upgrading the ENA driver to 2.12.0 as a potential fix, but the 5.15 kernel is already on 2.13.0. We should try the repro. |
Hi @koooosh I have experienced it with older versions also. I could try with v1.26.2 as well. What's also interesting, I believe I experienced it also when using the AL2 nodes also but I don't recall the versions of those. I am not super familiar with how kernel versions work with Kubernetes versions, is Kernel 5.15 the highest I would be able to run with Kube 1.25? I am not super keen to upgrade Kubernetes but would consider it if I knew it would for sure fix that problem. |
Here's another interesting one that VERY closely matches the issue I am having: awslabs/amazon-eks-ami#454 that was resolved with a kernel patch from quite a few years ago. |
Just another thought, since the NFS server runs in the Kernel, is there something sort of throttling/policing of system resources by kubelet that could be causing this? |
That's correct - the Bottlerocket k8s-1.25 variant specifically uses kernel 5.15. If you'd like to use kernel 6.1, you would have to use k8s 1.28. |
Alright... So I decided to try creating a new cluster with 1.25 and was going to upgrade a version at a time to see if the issue stopped happening... I've been having an issue in us-west-2 and I spun this one up in us-east-2 to try to reproduce it.... and it didn't have the issue.... Same NFS server container, same Kube version 1.25, same bottlerocket version (testing with 1.27.1 now) The CNI plugin wasn't the same, tried upgrading to match, still had the issue. I'm at a loss unless there's some underlying hardware difference you can't see.. The only difference I can think of is that some of the plugins are different between them. (No issue cluster has newer versions of EFS driver, kube-proxy, EBS driver, CoreDNS) and the cluster having the issue has Grafana Agent collecting logs/metrics. That's it... Anything else I can check!?
Edit: I noticed the platform version is different: |
Hey @snowzach , could you please provide more details about your setup? You mentioned EFS (CSI?) driver, but aren't you running your own NFS server? I'm trying to understand how the EFS driver comes into the picture if the goal is to use your own NFS server. On a side note, I did a first test with Another question I have for you is related to your experience of using and configuring the ephemeral storage. I tried to use |
I build the node using Karpenter with the following settings (I'm using slightly old version of Karpenter)
To mount the NVMes I just specify an As far as configuring ephemeral storage I think it would be good to loosen the directories... as well I think it would be good to allow setting mount options. I mount with Alright, so I am seeing that this is NOT necessarily related to Bottlerocket. Info:
I'm starting to wonder if this is the ENA driver or something to do with NVMEs (or them fighting with each other). Attached is a log from the AL2 instance with the same failure. It looks almost exactly the same instead it says xfs instead of ext4. |
Hey @snowzach, I gave this one more try but still I can't replicate what you are experiencing. Please let us know what additional configurations are you setting or the flags you are passing to Created a cluster with AMI [ssm-user@control]$ apiclient get settings.kernel
{
"settings": {
"kernel": {
"lockdown": "integrity",
"sysctl": {
"fs.file-max": "512000000",
"net.core.default_qdisc": "fq",
"net.core.rmem_max": "67108864",
"net.core.wmem_max": "67108864",
"net.ipv4.tcp_congestion_control": "htcp",
"net.ipv4.tcp_mtu_probing": "1",
"net.ipv4.tcp_rmem": "4096 87380 33554432",
"net.ipv4.tcp_wmem": "4096 87380 33554432",
"vm.min_free_kbytes": "524288"
}
}
}
}
[ssm-user@control]$ Created RAID array, and mounted it manually, because I didn't want to use mkdir /mnt/block
apiclient ephemeral-storage init -t ext4
mount /dev/md/ephemeral /mnt/block Created container image to run the NFS server with the same Dockerfile you provided, and a slightly modified version of your entrypoint (I used the default ports): #!/bin/bash
set -e
NFS_THREADS=${NFS_THREADS:-64}
function start() {
# prepare /etc/exports
fsid=0
for i in "$@"; do
echo "$i *(rw,fsid=$fsid,no_subtree_check,no_root_squash)" >> /etc/exports
if [ -v gid ] ; then
chmod 070 $i
chgrp $gid $i
fi
echo "Serving $i"
fsid=$((fsid + 1))
done
# start rpcbind if it is not started yet
set +e
/usr/sbin/rpcinfo 127.0.0.1 > /dev/null; s=$?
set -e
if [ $s -ne 0 ]; then
echo "Starting rpcbind"
/sbin/rpcbind -w
fi
mount -t nfsd nfds /proc/fs/nfsd
/usr/sbin/rpc.mountd
/usr/sbin/exportfs -r
# -G 10 to reduce grace time to 10 seconds (the lowest allowed)
# -V 3: enable NFSv3
/usr/sbin/rpc.nfsd -G 10 $NFS_THREADS
/sbin/rpc.statd --no-notify
echo "NFS started with $NFS_THREADS threads"
}
function stop()
{
echo "Stopping NFS"
/usr/sbin/rpc.nfsd 0
/usr/sbin/exportfs -au
/usr/sbin/exportfs -f
kill $( pidof rpc.mountd )
umount /proc/fs/nfsd
echo > /etc/exports
exit 0
}
trap stop TERM
start "$@"
# Keep the container running
sleep infinity Deployed the pods in the cluster, with:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nfs
spec:
selector:
matchLabels:
name: nfs
template:
metadata:
labels:
name: nfs
spec:
containers:
- name: nfs
# image: registry.k8s.io/volume-nfs:latest
image: <>.dkr.ecr.us-west-2.amazonaws.com/nfs-server-problem:v3
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: ephemeral
volumes:
- name: ephemeral
hostPath:
path: /mnt/block
type: Directory
nodeSelector:
node.kubernetes.io/instance-type: i3en.2xlarge
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nfs-client
spec:
selector:
matchLabels:
name: nfs-client
template:
metadata:
labels:
name: nfs-client
spec:
containers:
- name: nfs-client
command: ["sleep", "infinity"]
image: fedora:41
securityContext:
privileged: true
nodeSelector:
node.kubernetes.io/instance-type: m5.xlarge
---
apiVersion: v1
kind: Service
metadata:
name: nfs-1-26-1
spec:
selector:
name: nfs
ports:
- name: nfs
port: 2049
- name: mountd
port: 20048
- name: rpcbind
port: 111 Mounted the NFS server in the client: [root@nfs-client-dz59r /]# mount -t nfs -o vers=4.2 ${NFS_1_26_1_SERVICE_HOST}:/ /mnt
# Test the file was created in the remote filesystem
[root@nfs-client-dz59r /]# touch /mnt/test
# Perform the test
[root@nfs-client-dz59r /]# bonnie++ -d /mnt/ -u root -c 10
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 2.00 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
nfs-clie 31008M::10 1015k 99 549m 33 271m 21 892k 99 515m 21 +++++ +++
Latency 8920us 419ms 1335ms 1349ms 9199us 8192us
Version 2.00 ------Sequential Create------ --------Random Create--------
nfs-client-dz59r -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 0 10 +++++ +++ 0 11 0 10 0 12 0 11
Latency 11897us 25us 10324us 6796us 1563us 3903us
1.98,2.00,nfs-client-dz59r,10,1732664184,31008M,,8192,5,1015,99,562528,33,277052,21,892,99,527541,21,+++++,+++,16,,,,,1426,10,+++++,+++,3865,11,1441,10,9554,12,3912,11,8920us,419ms,1335ms,1349ms,9199us,8192us,11897us,25us,10324us,6796us,1563us,3903us
[root@nfs-client-dz59r /]# # From the host
bash-5.1# ls /mnt/block/test
/mnt/block/test
## After starting the test
bash-5.1# ls /mnt/block/
Bonnie.270 lost+found test
## No logs regarding incomplete TX/RX
bash-5.1# journalctl -k | grep "TX hasn't completed"
# Nothing |
Yeah, I have no idea what the issue is... It very much seems like maybe some sort of hardware/driver issue related maybe to the ENA + NVMe contention... just a guess... So another interesting thing I have done to the resolve the issue (thus far) is to disable the sync option on the NFS server. Since it's basically a ephemeral drive I don't care about sync'ing. I'm guessing that has reduced the contention/load and now I have been unable to trigger the error since... Like I said, I setup the exact same thing in US East 2 and I could not make the issue happen. I have no idea at this point. |
Thanks for your patience @snowzach, and I'm glad you have a workaround. We will engage with the ENA team to understand what could be the root cause. They may have a better way to reproduce but gladly you provided that logs so that's great! On the |
Hi @snowzach, we've tried a few more times to reproduce this issue internally to no avail. Have you discovered anything unique about the cluster where this was occurring that might help track it down? Also, could you please confirm for me that both the AL2 and the Bottlerocket nodes that showed this error ran on the same EKS cluster? If you are able to open a ticket with AWS support and provide as many details about the instances this happened to, we'd really appreciate it. This might help us troubleshoot if there might be more going on like hardware issues. |
Hi @rpkelly, I haven't really discovered anything unique in this case unfortunately. I can confirm I have seen the issue both on AL2 and Bottlerocket on the same EKS cluster. I can open a ticket with AWS. |
Platform I'm building on:
Running a very simple NFS server container on
bottlerocket-aws-k8s-1.25-x86_64-v1.26.1-943d9a41
Dockerfile:
Entrypoint script:
Essentially I run this on an AWS i3en with local flash provisioned as ephemeral storage shared via this NFS server. It's a high performance cache drive. Testing with
i3en.2xlarge
What I expected to happen:
It would be a super fast NFS server sharing this ephemeral storage.
What actually happened:
I can mount this storage from another
i3en.2xlarge
instance and mostly it works unless we really push it.If I run the disk testing tool
bonnie++ -d /the/nfs/share -u nobody
and wait, within a minute or two the machine will start displaying errors in the logs aboutwatchdog: BUG: soft lockup - CPU#7 stuck for 22s!
as well asena 0000:00:06.0 eth2: TX hasn't completed, qid 5, index 801. 26560 msecs since last interrupt, 41910 msecs since last napi execution, napi scheduled: 1
How to reproduce the problem:
Run the container, run bonnie++ on the NFS share.
It's very reliably reproduced.
Attached is a log:
bottlerocket-log.txt
The text was updated successfully, but these errors were encountered: