-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to wait PVBs - 1.15.1 - Backup #8587
Comments
Could you also please check whether there is any information from the OVH-managed k8s kube-apiserver? |
Hello, I didn't find any relevant information. I don't know if it is the switch from Restic to Kopia that is causing this problem. However, I think that the backup is still effective even if it's marked as PartiallyFailed. pvb | wc -l I did some cleanup and increased the CPU limit of Velero from 1 to 2 because in my Grafana, I saw that it easily goes over 1 CPU. |
Ok i just did another try with the 2cpu limit. Velero CPU goes to 1.8CPU This time the backup finished in Completed. I will check again with our daily backup. I add also that i have two type of schedules : 1 specially for FSB and 1 specially for snapshot. For both i disabled snapshotMoveData. And also thoses problems came after a cluster migration. |
Could you expand on how you get the PVB count? We are hitting the same issue in one of our clusters. |
"kubectl get podvolumebackups.velero.io -A" is my pvb alias. kubectl get podvolumebackups.velero.io -A | wc -l |
@Gui13 What error did you get? Is it the same with this one |
Subject: Feedback on Last Night’s Scheduled Backup Hello, Here is some feedback regarding last night's scheduled backup. As mentioned before, I have three schedules:
These schedules are replicated across 4 clusters. Last week, I was running Velero 1.14.1 with the AWS plugin 1.8.0. At the end of last week, I upgraded to 1.15.1 with the AWS plugin 1.11.0. During the upgrade, I had to set the Yesterday, I made two changes:
Results:
Example of Error Found in Loki:
Notably, the two production clusters (which have fewer applications) reported zero errors, while the two review/staging clusters (where development teams deploy frequently) experienced issues. In addition, only the staging applications have PVCs. Resources:From Grafana, I observed that Velero utilized the full 2 CPUs during each backup. Interestingly, the node-agent’s memory usage didn’t exceed approximately 600 Mi, which contrasts with what I observed in 1.14.1, where the node-agent often exceeded 1 Gi and failed to release memory (requiring a restart of the DaemonSet). I’m unsure if there have been changes in how resource management is handled in 1.15. |
@fwernert Thanks for your detailed feedback. Are all these three schedules How many PVBs on these two staging clusters? There is a possible improvement to fix the |
…o avoid timeout issue Check the PVB status via podvolume Backupper rather than API server to avoid ti meout issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
Yes all these three schedules are on all the clusters. This morning no PartiallyFailed, don't know why. staging 1 PVBs: 2100 PVBs are used only for snapshot backup is that right ? FSB doesn't need those? And node-agent didn't released memory :
|
Hi @fwernert, I made some approvement to avoid the |
@ywk253100 I tested the image. For me the fix is working |
@NicoJDE Thanks! |
I will try it now and give some feedbacks. |
The backup succeed with your last docker image. However it also works last last so i dunno how to measure it. Any recommandation to control your fix? Do you know why velero frequently is a lot of cpu ? What he is doing when he is not doing backup ? |
I just also launch the fsb backup : |
Could you trigger more backups from all these three schedules (e.g. create one backup every 5 minutes) and check whether the same error will happen again?
There are some maintenance activities for the backup repositories (one namespace maps to one repository), the default frequency is one hour. And there are also schedule backups in the cluster, so per my understanding, seems this is the expected behavior. @Lyndon-Li Correct me if I'm wrong. |
Our scheduled backups are only at night.
I can try yes. |
I ran 3 backups at 5 mins intervals
type of Failure ( * 26 times)=
Then i just relaunch the PartillayFailed backup and :
I don't know what happened, but it seems there are no PVB errors now with your fix. 👍 |
…server to avoid API server issue Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
@fwernert Thanks! |
…server to avoid API server issue Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
…server to avoid API server issue Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
Thank you @ywk253100 ! Could you tell us when 1.15.2 is ready ? |
What steps did you take and what happened:
Any creation of backup Manually or with Schedule.
Errors:
Velero: message: /failed to wait PVBs processed for the ItemBlock error: /failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)
What did you expect to happen:
a Completed status and not PartiallyFailed.
Anything else you would like to add:
Environment:
velero version
):velero client config get features
):features: <NOT SET>
kubectl version
):OVH Managed Kubernetes, Bucket S3 OVH and Scaleway
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: