-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVs created from RunnerSet are not reused - results in PVs buildup #2282
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
We had a similar problem. The cluster was set up zone redundant but the storage was LRS. A new pod came up on another node and the available PV was not attached. With 3 zones the PVs quickly piled up. We now schedule the runners in one zone only. Check the attach-detach controller events for additional information. |
similar for us. in our case there could be already 115 “Available” PVs with the name volumeClaimTemplates:
- metadata:
name: var-lib-docker
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
storageClassName: var-lib-docker PVs keep accumulating in this way. any idea why this would happen? |
I assume this is related to this issue. The problem is that the StatefulSet isn't being scaled, additional StatefulSets are being added to the RunnerSet. So if the old StatefulSets are being deleted (again, not scaled back) then the PVs will persist. |
For us this seems to happen on Azure AKS when the new scaled up pod cannot be immediately scheduled in the cluster and needs to wait for the azure node autoscaling. After AKS added a new node, the new pod is scheduled, but then in the PVC events you can see the following:
Even though there are plenty of free PV's, every time right after node scale-up the first PVC fails to attach an existing free PV and a new one is created. When a next runner pod is scheduled on this new node, it does manage to attach any of the existing PV's, so it seems like some race condition here between the arc controller pod and the CSI auto provisioner? Every time after a node scale-up we get one extra PV. |
@mhuijgen Did you ever figure out any sort of solution for this? We're running into the exact same issue using EKS with autoscaling and the EBS CSI driver for dynamic PV provisioning. |
@benmccown No unfortunately not. The same also occurs occasionally even without node scaling in our tests, making this feature unusable at this time. Nodescaleup is just making the issue appear more often. It seems to be a race condition between the runner controller trying to link the new pvc to an existing volume and the auto provisioner in the cluster creating a new pv. |
@mhuijgen Thanks for the response. For yourself (and anyone else who runs into this issue) I think I've come up with the best possible workaround I can think of for the moment, which is basically to abandon the use of dynamic storage provisioning entirely and use static PVs. I'll provide details on our workaround, but first I'll give a brief summary of our setup and use case for ARC maintainers in case they read this. Our Setup and Problem DetailsWe are using cluster-autoscaler to manage our EKS autoscaling groups. We have a dedicated node group for our actions runners (I'll use the term CI runners). We use node labels, node taints, and resource requests to manage this node group so that only GitHub CI pods run on the CI runner node group. So each CI pod is running in a 1:1 relationship with nodes (one pod per node). We have 0 set as our minimum autoscaling size for this node group. We're using a We're regularly scaling down to zero nodes in periods of inactivity, but we might have a burst of activity where several workflows are kicked off and thus several CI pods are created and scheduled. Cluster autoscaler will then respond in turn and scale out our ASG and join new nodes to the cluster that will execute the CI workloads. Without any sort of image cache we waste ~2m30s for every single CI job to pull our container image into the dind (docker in docker) container within the CI pod. We could set The issue we're seeing has already been detailed well by @mhuijgen and is definitely some sort of race condition as they've said. The result (in my testing) is that in a few short days we had 20+ persistent volumes provisioned and the number was continuing to grow. Aside from the orphaned PVs left around (and resulting EBS volumes) costing us money, the major downside is that it seems almost 100% of the time when a new CI job is scheduled, pod created, and resulting worker node is created (by cluster autoscaler due to scale out operation) it seems a new PV is created and one isn't reused, which completely eliminates the point of an image cache and any of the performance benefits. WorkaroundThe workaround for us is to provision static PVs using Terraform instead of letting the EBS CSI controller manage dynamic volume provisioning. We're using Terraform to deploy/manage our EKS cluster, EKS node groups, as well as associated resources (helm charts and raw k8s resources too). I set up a basic for loop in Terraform that provisions This way the CSI controller isn't trying to create dynamic PVs at all and the volumes are always reused. So the race condition is eliminated by removing 1 of the 2 parties participating in the "race". The downsides here are that EBS volumes are availability zone specific, so I have to put the node group in a single subnet and availability zone. And you're paying for the max number of EBS volumes which is a downside I guess, except the bug we're running into with our use case means you'll end up with WAY more volumes than your max autoscaling size eventually anyway. I'll probably set up a GH workflow that runs 10 parallel jobs daily to ensure the container image is pulled down and up to date on the PVs. Hope this helps someone in the future. |
I think I am hitting the same bug. As far as I can tell, it began after my transition from the built-in EBS provisioner to the EBS CSI provisioner. For example, using dynamically allocated PV/PVC with a StorageClass that looks like this works correctly (PVs don't build up forever): apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2
parameters:
fsType: ext4
type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false However, a dynamically allocated PV/PVC with a StorageClass that looks like this builds up PVs: apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
parameters:
csi.storage.k8s.io/fstype: xfs
encrypted: "true"
type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false I think this issue is related or a duplicate: #2266 |
Here's some info I found. Like @mhuijgen, I noticed peculiar warnings in our events:
Each of these warnings coincided with the creation of a new volume. Checking on the apiVersion: storage.k8s.io/v1
kind: CSINode
# [...]
spec:
drivers:
# [...]
- #[...]
name: ebs.csi.aws.com
topologyKeys:
- topology.ebs.csi.aws.com/zone So I came to the conclusion there is indeed a race condition: somehow, if the CSI node doesn't have the topology keys set at the moment a volume is requested, then a new volume is created, even though there could be plenty available. This explains why this issue only happens with pods scheduled on fresh nodes. So I've put in place a workaround. In short, it consists of:
So far, it's been working great. Our EBS volumes are consistently being reused. I don't know the exact root cause here, but I'm pretty sure it's not ARC's fault. As a matter of fact, it seems someone is able to reproduce the issue here with simple |
Checks
Controller Version
0.22.0
Helm Chart Version
v0.7.2
CertManager Version
v1.6.1
Deployment Method
Helm
cert-manager installation
Checks
Resource Definitions
To Reproduce
Describe the bug
I configured volumeClaimTemplates for the RunnerSet and number of replicas is 5. The volumeClaimTemplates contains 2 persistent volume mappings - one for docker and another for gradle. The runners are configured as ephemeral: true.
At the start of the RunnerSet deployment 10 PVs (5*2 - one for docker and another for gradle) are created and bound to all the runner pods. When a newly assigned workflow is run and completed on a runner, the runner pod is deleted and a brand-new pod is created and listens for jobs. Unfortunately, the newly created pod does not attach the recently freed available PVs (from the deleted runner pod), but instead it creates new set of PVs and attaches them to it.
Over the period of time these redundant PVs accumulate and the system becomes out of disk space.
Describe the expected behavior
PVs created from the RunnerSet deployment should be used efficiently.
When a newly assigned workflow is run and completed on a runner, the newly created pod should attach the recently freed available PVs from the deleted runner pod.
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
NA
The text was updated successfully, but these errors were encountered: