Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual Volume detach case not handled #2537

Open
CoreyCook8 opened this issue Sep 28, 2024 · 4 comments
Open

Manual Volume detach case not handled #2537

CoreyCook8 opened this issue Sep 28, 2024 · 4 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@CoreyCook8
Copy link

CoreyCook8 commented Sep 28, 2024

What happened:

After a volume was manually detached from a VM, two pods were mistakenly using the same Volume as their mounted volume.

What you expected to happen:

In an AWS cluster, this same issue gives this error message:

  Warning  FailedMount  14s (x6 over 30s)  kubelet            MountVolume.MountDevice failed for volume "pvc-XXXXX" : rpc error: code = Internal desc = Failed to find device path /dev/xvdaa. refusing to mount /dev/nvme3n1 because it claims to be volX but should be volY

I would expect this to be handled in a similar manner.

How to reproduce it:

  1. Create a pod that mounts a PVC.
  2. After the pod is running, manually detach the disk using the azure portal. (The pod will still show as running)
  3. Create another pod that mounts a PVC and assign it to the same node.
  4. Both pods should be running at this point.
  5. Delete & recreate the first pod
  6. The pod should go into a running state without the volume attaching
  7. At this point, they will both be using the same Volume
  8. To verify you can exec into both pods, create a file in the mounted directory in one and verify that it's shown in the other pod

Anything else we need to know?:

Environment:

  • CSI Driver version: 1.30.4
  • Kubernetes version (use kubectl version): 1.28.9
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS
  • Kernel (e.g. uname -a): Linux 5.4.0-1138-azure # 145-Ubuntu SMP Fri Aug 30 16:04:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
@andyzhangx
Copy link
Member

this is expected since on Azure VM, device name is not bound to disk name, e.g. disk1 is mounted as /dev/sdc, and when disk1 is manually detached and disk2 is attached to the VM, disk2 is mounted as /dev/sdc, if you delete & recreate the first pod with disk1 volume, then disk1 would still use /dev/sdc since at that time CSI driver thinks that disk1 is still attached to the VM, it would just reuse the previous device name(de/sdc).

BTW, manual volume detach is not supported CSI driver scenario, that's out of CSI driver control.

@CoreyCook8
Copy link
Author

I understand that manual detach is out of the control of the csi driver. But, I would expect the CSI driver to ensure that a new pod is using the volume it has requested and not another pod's volume. If the pod is deleted, and the new pod is attached to the same VM, I would expect the csi driver to check the drive, and make sure the expected volume == the actual volume.

Or, when attaching the second disk to the same drive as the first disk, it would realize that a disk should already be there / realize that the first disk is no longer attached.

@andyzhangx
Copy link
Member

due to the manual detach, the kubelet thinks that the disk1 is already attached to the node, thus CSI driver won't be called (no NodeStageVolume call) to verify the drive.

When attaching disk2 to the VM, using the same device name(/dev/sdc) is actually ok (this is also out of CSI driver control, it's controlled by linux disk kernel driver), I think the main problem is that when you do the manual detach, you should reschedule the first pod to other node, that would work. Otherwise we don't have a solution how to make this work since it's out of CSI driver control

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants