failed to wait PVBs - 1.15.1 - Backup #8587

fwernert · 2025-01-07T09:29:00Z

What steps did you take and what happened:
Any creation of backup Manually or with Schedule.

Errors:
Velero: message: /failed to wait PVBs processed for the ItemBlock error: /failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)

What did you expect to happen:
a Completed status and not PartiallyFailed.

Anything else you would like to add:

Environment:

Velero version (use velero version):

Client:
        Version: v1.15.1
        Git commit: 32499fc287815058802c1bc46ef620799cca7392
Server:
        Version: v1.15.1

Velero features (use velero client config get features):
features: <NOT SET>
Kubernetes version (use kubectl version):
Kubernetes installer & version:

Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.1

Cloud provider or hardware configuration:
OVH Managed Kubernetes, Bucket S3 OVH and Scaleway

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

blackpiglet · 2025-01-07T13:08:19Z

Could you also please check whether there is any information from the OVH-managed k8s kube-apiserver?
To me, this is more like the k8s kube-apiserver didn't give response to Velero client's requests.

fwernert · 2025-01-07T14:54:04Z

Hello,

I didn't find any relevant information.

I don't know if it is the switch from Restic to Kopia that is causing this problem. However, I think that the backup is still effective even if it's marked as PartiallyFailed.

pvb | wc -l
2175

I did some cleanup and increased the CPU limit of Velero from 1 to 2 because in my Grafana, I saw that it easily goes over 1 CPU.

fwernert · 2025-01-07T15:01:59Z

Ok i just did another try with the 2cpu limit. Velero CPU goes to 1.8CPU

This time the backup finished in Completed. I will check again with our daily backup.

I add also that i have two type of schedules : 1 specially for FSB and 1 specially for snapshot. For both i disabled snapshotMoveData.

And also thoses problems came after a cluster migration.

Gui13 · 2025-01-07T17:08:58Z

pvb | wc -l
2175

Could you expand on how you get the PVB count? We are hitting the same issue in one of our clusters.

fwernert · 2025-01-07T17:30:29Z

pvb | wc -l
2175

Could you expand on how you get the PVB count? We are hitting the same issue in one of our clusters.

"kubectl get podvolumebackups.velero.io -A" is my pvb alias.

kubectl get podvolumebackups.velero.io -A | wc -l

ywk253100 · 2025-01-08T02:37:53Z

@Gui13 What error did you get? Is it the same with this one failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)?

fwernert · 2025-01-08T08:53:34Z

Subject: Feedback on Last Night’s Scheduled Backup

Hello,

Here is some feedback regarding last night's scheduled backup.

As mentioned before, I have three schedules:

1 OVH FSB Backup
1 Scaleway FSB Backup
1 OVH CSI Backup

These schedules are replicated across 4 clusters.

Last week, I was running Velero 1.14.1 with the AWS plugin 1.8.0. At the end of last week, I upgraded to 1.15.1 with the AWS plugin 1.11.0. During the upgrade, I had to set the checkSumAlgorithm to "" for it to work properly.

Yesterday, I made two changes:

Increased the CPU allocation for the deployment/velero from 1 CPU to 2 CPUs.
Set snapshotMoveData to false for the FSB Backup.
Switch from Restic to Kopia because of the Restic deprecation

Results:

ovh-daily-fsb-backup: Completed
scaleway-daily-fsb-backup: Completed
ovh-daily-csi-snapshot: PartiallyFailed (54 errors related to PVBs)

Example of Error Found in Loki:

time="2025-01-08T04:03:08Z" level=error msg="failed to wait PVBs processed for the ItemBlock" backup=velero/ovh-daily-csi-snapshot-20250108040051 error="failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/backup.go:748" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*kubernetesBackupper).waitUntilPVBsProcessed" logSource="pkg/backup/backup.go:726"

Notably, the two production clusters (which have fewer applications) reported zero errors, while the two review/staging clusters (where development teams deploy frequently) experienced issues. In addition, only the staging applications have PVCs.

Resources:

From Grafana, I observed that Velero utilized the full 2 CPUs during each backup. Interestingly, the node-agent’s memory usage didn’t exceed approximately 600 Mi, which contrasts with what I observed in 1.14.1, where the node-agent often exceeded 1 Gi and failed to release memory (requiring a restart of the DaemonSet).

I’m unsure if there have been changes in how resource management is handled in 1.15.

ywk253100 · 2025-01-09T03:34:39Z

@fwernert Thanks for your detailed feedback.

Are all these three schedules ovh-daily-fsb-backup, scaleway-daily-fsb-backup and ovh-daily-csi-snapshot on the two staging clusters? And only ovh-daily-csi-snapshot backup failed but ovh-daily-fsb-backup and scaleway-daily-fsb-backup completed as expected?

How many PVBs on these two staging clusters?

There is a possible improvement to fix the failed to wait PVBs processed for the ItemBlock issue, is that OK that I provide you a patch image and you verify it in your env?

…o avoid timeout issue Check the PVB status via podvolume Backupper rather than API server to avoid ti meout issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>

fwernert · 2025-01-09T08:18:31Z

@fwernert Thanks for your detailed feedback.

Are all these three schedules ovh-daily-fsb-backup, scaleway-daily-fsb-backup and ovh-daily-csi-snapshot on the two staging clusters? And only ovh-daily-csi-snapshot backup failed but ovh-daily-fsb-backup and scaleway-daily-fsb-backup completed as expected?

How many PVBs on these two staging clusters?

There is a possible improvement to fix the failed to wait PVBs processed for the ItemBlock issue, is that OK that I provide you a patch image and you verify it in your env?

Yes all these three schedules are on all the clusters.

This morning no PartiallyFailed, don't know why.

staging 1 PVBs: 2100
staging 2 PVBs: 2335
prod 1 PVBs: 2036
prod 2 PVBs: 2036

PVBs are used only for snapshot backup is that right ? FSB doesn't need those?

And node-agent didn't released memory :

$ k top pod --sort-by memory
NAME                      CPU(cores)   MEMORY(bytes)
node-agent-r4b46          2m           1617Mi
node-agent-hm5n9          2m           583Mi
velero-77f9bd8bf6-5hpgk   10m          176Mi
node-agent-n45s6          1m           155Mi
node-agent-xwkzw          2m           152Mi
node-agent-9svf7          2m           100Mi
node-agent-2h8zg          1m           90Mi

ywk253100 · 2025-01-09T10:54:45Z

Hi @fwernert, I made some approvement to avoid the the server was unable to return a response in the time allotted error when there are a lot of PVBs, appreciate if you can try that in your env to verify whether it works or not.
Here is the Velero image: yinw/velero:verification01.

NicoJDE · 2025-01-09T13:13:44Z

Hi,

we have the same problem. The problem occurs after updating from velero 1.14 to velero 1.15.1. Interesting is that since the update at the backup time the Kubernetes API memory spikes extremely upwards.

NicoJDE · 2025-01-09T13:33:44Z

@ywk253100 I tested the image. For me the fix is working

ywk253100 · 2025-01-09T13:35:34Z

@NicoJDE Thanks!

fwernert · 2025-01-09T13:38:13Z

I will try it now and give some feedbacks.

fwernert · 2025-01-09T14:35:57Z

velero create backup --from-schedule ovh-daily-csi-snapshot
ovh-daily-csi-snapshot-20250109133853      Completed         0        49         2025-01-09 14:38:54 +0100 CET   3d        ovh-s3-bucket

The backup succeed with your last docker image. However it also works last last so i dunno how to measure it. Any recommandation to control your fix?

Do you know why velero frequently is a lot of cpu ? What he is doing when he is not doing backup ?

fwernert · 2025-01-09T15:00:23Z

I just also launch the fsb backup :
ovh-daily-fsb-backup-20250109143357 Completed 0 2 2025-01-09 15:33:57 +0100 CET 19d ovh-s3-bucket

ywk253100 · 2025-01-09T15:15:37Z

The backup succeed with your last docker image. However it also works last last so i dunno how to measure it. Any recommandation to control your fix?

Could you trigger more backups from all these three schedules (e.g. create one backup every 5 minutes) and check whether the same error will happen again?

Do you know why velero frequently is a lot of cpu ? What he is doing when he is not doing backup ?

There are some maintenance activities for the backup repositories (one namespace maps to one repository), the default frequency is one hour. And there are also schedule backups in the cluster, so per my understanding, seems this is the expected behavior. @Lyndon-Li Correct me if I'm wrong.

fwernert · 2025-01-09T15:25:14Z

There are some maintenance activities for the backup repositories (one namespace maps to one repository), the default frequency is one hour. And there are also schedule backups in the cluster, so per my understanding, seems this is the expected behavior. @Lyndon-Li Correct me if I'm wrong.

Our scheduled backups are only at night.

Could you trigger more backups from all these three schedules (e.g. create one backup every 5 minutes) and check whether the same error will happen again?

I can try yes.

fwernert · 2025-01-09T16:34:02Z

I ran 3 backups at 5 mins intervals

$ velero backup get
NAME                                       STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION     SELECTOR
ovh-daily-csi-snapshot-20250109162052      Completed    0        0          2025-01-09 17:20:53 +0100 CET   3d        ovh-s3-bucket        <none>

scaleway-daily-fsb-backup-20250109162248   Completed         0        13         2025-01-09 17:25:03 +0100 CET   19d       scaleway-s3-bucket   <none>

ovh-daily-fsb-backup-20250109162145        PartiallyFailed   26       13         2025-01-09 17:22:45 +0100 CET   19d       ovh-s3-bucket        <none>

type of Failure ( * 26 times)=

             name: /velero-redis-data-verifybadge-redis-master-0-h8sqc message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=pqp-verifybadge, name=velero-redis-data-verifybadge-redis-master-0-h8sqc): rpc error: code = Unknown desc = volumesnapshots.snapshot.storage.k8s.io "velero-redis-data-verifybadge-redis-master-0-h8sqc" not found

Then i just relaunch the PartillayFailed backup and :

ovh-daily-fsb-backup-20250109162935 Completed 0 13 2025-01-09 17:29:36 +0100 CET 19d ovh-s3-bucket <none>

I don't know what happened, but it seems there are no PVB errors now with your fix. 👍

…server to avoid API server issue Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>

ywk253100 · 2025-01-10T02:19:02Z

@fwernert Thanks!

…server to avoid API server issue Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue Fixes vmware-tanzu#8587 Signed-off-by: Wenkai Yin(尹文开) <[email protected]>

fwernert · 2025-01-10T10:17:15Z

Thank you @ywk253100 ! Could you tell us when 1.15.2 is ready ?

blackpiglet added the Needs info Waiting for information label Jan 7, 2025

Lyndon-Li added the area/fs-backup label Jan 8, 2025

ywk253100 self-assigned this Jan 8, 2025

ywk253100 mentioned this issue Jan 10, 2025

Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue #8596

Open

3 tasks

ywk253100 added this to the v1.16 milestone Jan 10, 2025

ywk253100 added the target/1.15.2 label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to wait PVBs - 1.15.1 - Backup #8587

failed to wait PVBs - 1.15.1 - Backup #8587

fwernert commented Jan 7, 2025

blackpiglet commented Jan 7, 2025

fwernert commented Jan 7, 2025

fwernert commented Jan 7, 2025 •

edited

Loading

Gui13 commented Jan 7, 2025

fwernert commented Jan 7, 2025

ywk253100 commented Jan 8, 2025

fwernert commented Jan 8, 2025 •

edited

Loading

ywk253100 commented Jan 9, 2025

fwernert commented Jan 9, 2025 •

edited

Loading

ywk253100 commented Jan 9, 2025

NicoJDE commented Jan 9, 2025 •

edited

Loading

NicoJDE commented Jan 9, 2025

ywk253100 commented Jan 9, 2025

fwernert commented Jan 9, 2025

fwernert commented Jan 9, 2025

fwernert commented Jan 9, 2025

ywk253100 commented Jan 9, 2025

fwernert commented Jan 9, 2025

fwernert commented Jan 9, 2025 •

edited

Loading

ywk253100 commented Jan 10, 2025

fwernert commented Jan 10, 2025

failed to wait PVBs - 1.15.1 - Backup #8587

failed to wait PVBs - 1.15.1 - Backup #8587

Comments

fwernert commented Jan 7, 2025

blackpiglet commented Jan 7, 2025

fwernert commented Jan 7, 2025

fwernert commented Jan 7, 2025 • edited Loading

Gui13 commented Jan 7, 2025

fwernert commented Jan 7, 2025

ywk253100 commented Jan 8, 2025

fwernert commented Jan 8, 2025 • edited Loading

Results:

Example of Error Found in Loki:

Resources:

ywk253100 commented Jan 9, 2025

fwernert commented Jan 9, 2025 • edited Loading

ywk253100 commented Jan 9, 2025

NicoJDE commented Jan 9, 2025 • edited Loading

NicoJDE commented Jan 9, 2025

ywk253100 commented Jan 9, 2025

fwernert commented Jan 9, 2025

fwernert commented Jan 9, 2025

fwernert commented Jan 9, 2025

ywk253100 commented Jan 9, 2025

fwernert commented Jan 9, 2025

fwernert commented Jan 9, 2025 • edited Loading

ywk253100 commented Jan 10, 2025

fwernert commented Jan 10, 2025

fwernert commented Jan 7, 2025 •

edited

Loading

fwernert commented Jan 8, 2025 •

edited

Loading

fwernert commented Jan 9, 2025 •

edited

Loading

NicoJDE commented Jan 9, 2025 •

edited

Loading

fwernert commented Jan 9, 2025 •

edited

Loading