Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to wait PVBs - 1.15.1 - Backup #8587

Open
fwernert opened this issue Jan 7, 2025 · 21 comments
Open

failed to wait PVBs - 1.15.1 - Backup #8587

fwernert opened this issue Jan 7, 2025 · 21 comments
Assignees
Labels
Milestone

Comments

@fwernert
Copy link

fwernert commented Jan 7, 2025

What steps did you take and what happened:
Any creation of backup Manually or with Schedule.

Errors:
Velero: message: /failed to wait PVBs processed for the ItemBlock error: /failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)

What did you expect to happen:
a Completed status and not PartiallyFailed.

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
Client:
        Version: v1.15.1
        Git commit: 32499fc287815058802c1bc46ef620799cca7392
Server:
        Version: v1.15.1
  • Velero features (use velero client config get features):
    features: <NOT SET>
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.1
  • Cloud provider or hardware configuration:
    OVH Managed Kubernetes, Bucket S3 OVH and Scaleway

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet
Copy link
Contributor

Could you also please check whether there is any information from the OVH-managed k8s kube-apiserver?
To me, this is more like the k8s kube-apiserver didn't give response to Velero client's requests.

@blackpiglet blackpiglet added the Needs info Waiting for information label Jan 7, 2025
@fwernert
Copy link
Author

fwernert commented Jan 7, 2025

Hello,

I didn't find any relevant information.

I don't know if it is the switch from Restic to Kopia that is causing this problem. However, I think that the backup is still effective even if it's marked as PartiallyFailed.

pvb | wc -l
2175

I did some cleanup and increased the CPU limit of Velero from 1 to 2 because in my Grafana, I saw that it easily goes over 1 CPU.

@fwernert
Copy link
Author

fwernert commented Jan 7, 2025

Ok i just did another try with the 2cpu limit. Velero CPU goes to 1.8CPU

image

This time the backup finished in Completed. I will check again with our daily backup.

I add also that i have two type of schedules : 1 specially for FSB and 1 specially for snapshot. For both i disabled snapshotMoveData.

And also thoses problems came after a cluster migration.

@Gui13
Copy link

Gui13 commented Jan 7, 2025

pvb | wc -l
2175

Could you expand on how you get the PVB count? We are hitting the same issue in one of our clusters.

@fwernert
Copy link
Author

fwernert commented Jan 7, 2025

pvb | wc -l
2175

Could you expand on how you get the PVB count? We are hitting the same issue in one of our clusters.

"kubectl get podvolumebackups.velero.io -A" is my pvb alias.

kubectl get podvolumebackups.velero.io -A | wc -l

@ywk253100
Copy link
Contributor

@Gui13 What error did you get? Is it the same with this one failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)?

@ywk253100 ywk253100 self-assigned this Jan 8, 2025
@fwernert
Copy link
Author

fwernert commented Jan 8, 2025

Subject: Feedback on Last Night’s Scheduled Backup

Hello,

Here is some feedback regarding last night's scheduled backup.

As mentioned before, I have three schedules:

  • 1 OVH FSB Backup
  • 1 Scaleway FSB Backup
  • 1 OVH CSI Backup

These schedules are replicated across 4 clusters.

Last week, I was running Velero 1.14.1 with the AWS plugin 1.8.0. At the end of last week, I upgraded to 1.15.1 with the AWS plugin 1.11.0. During the upgrade, I had to set the checkSumAlgorithm to "" for it to work properly.

Yesterday, I made two changes:

  1. Increased the CPU allocation for the deployment/velero from 1 CPU to 2 CPUs.
  2. Set snapshotMoveData to false for the FSB Backup.
  3. Switch from Restic to Kopia because of the Restic deprecation

Results:

  • ovh-daily-fsb-backup: Completed
  • scaleway-daily-fsb-backup: Completed
  • ovh-daily-csi-snapshot: PartiallyFailed (54 errors related to PVBs)

Example of Error Found in Loki:

time="2025-01-08T04:03:08Z" level=error msg="failed to wait PVBs processed for the ItemBlock" backup=velero/ovh-daily-csi-snapshot-20250108040051 error="failed to list PVBs: the server was unable to return a response in the time allotted, but may still be processing the request (get podvolumebackups.velero.io)" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/backup.go:748" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*kubernetesBackupper).waitUntilPVBsProcessed" logSource="pkg/backup/backup.go:726"

Notably, the two production clusters (which have fewer applications) reported zero errors, while the two review/staging clusters (where development teams deploy frequently) experienced issues. In addition, only the staging applications have PVCs.

Resources:

From Grafana, I observed that Velero utilized the full 2 CPUs during each backup. Interestingly, the node-agent’s memory usage didn’t exceed approximately 600 Mi, which contrasts with what I observed in 1.14.1, where the node-agent often exceeded 1 Gi and failed to release memory (requiring a restart of the DaemonSet).

I’m unsure if there have been changes in how resource management is handled in 1.15.

@ywk253100
Copy link
Contributor

@fwernert Thanks for your detailed feedback.

Are all these three schedules ovh-daily-fsb-backup, scaleway-daily-fsb-backup and ovh-daily-csi-snapshot on the two staging clusters? And only ovh-daily-csi-snapshot backup failed but ovh-daily-fsb-backup and scaleway-daily-fsb-backup completed as expected?

How many PVBs on these two staging clusters?

There is a possible improvement to fix the failed to wait PVBs processed for the ItemBlock issue, is that OK that I provide you a patch image and you verify it in your env?

ywk253100 added a commit to ywk253100/velero that referenced this issue Jan 9, 2025
…o avoid timeout issue

Check the PVB status via podvolume Backupper rather than API server to avoid ti
meout issue

Fixes vmware-tanzu#8587

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
@fwernert
Copy link
Author

fwernert commented Jan 9, 2025

@fwernert Thanks for your detailed feedback.

Are all these three schedules ovh-daily-fsb-backup, scaleway-daily-fsb-backup and ovh-daily-csi-snapshot on the two staging clusters? And only ovh-daily-csi-snapshot backup failed but ovh-daily-fsb-backup and scaleway-daily-fsb-backup completed as expected?

How many PVBs on these two staging clusters?

There is a possible improvement to fix the failed to wait PVBs processed for the ItemBlock issue, is that OK that I provide you a patch image and you verify it in your env?

Yes all these three schedules are on all the clusters.

This morning no PartiallyFailed, don't know why.

staging 1 PVBs: 2100
staging 2 PVBs: 2335
prod 1 PVBs: 2036
prod 2 PVBs: 2036

PVBs are used only for snapshot backup is that right ? FSB doesn't need those?

And node-agent didn't released memory :

$ k top pod --sort-by memory
NAME                      CPU(cores)   MEMORY(bytes)
node-agent-r4b46          2m           1617Mi
node-agent-hm5n9          2m           583Mi
velero-77f9bd8bf6-5hpgk   10m          176Mi
node-agent-n45s6          1m           155Mi
node-agent-xwkzw          2m           152Mi
node-agent-9svf7          2m           100Mi
node-agent-2h8zg          1m           90Mi

@ywk253100
Copy link
Contributor

Hi @fwernert, I made some approvement to avoid the the server was unable to return a response in the time allotted error when there are a lot of PVBs, appreciate if you can try that in your env to verify whether it works or not.
Here is the Velero image: yinw/velero:verification01.

@NicoJDE
Copy link

NicoJDE commented Jan 9, 2025

Hi,

we have the same problem. The problem occurs after updating from velero 1.14 to velero 1.15.1. Interesting is that since the update at the backup time the Kubernetes API memory spikes extremely upwards.

Screenshot 2025-01-09 at 14 11 50

@NicoJDE
Copy link

NicoJDE commented Jan 9, 2025

@ywk253100 I tested the image. For me the fix is working

@ywk253100
Copy link
Contributor

@NicoJDE Thanks!

@fwernert
Copy link
Author

fwernert commented Jan 9, 2025

I will try it now and give some feedbacks.

@fwernert
Copy link
Author

fwernert commented Jan 9, 2025

velero create backup --from-schedule ovh-daily-csi-snapshot
ovh-daily-csi-snapshot-20250109133853      Completed         0        49         2025-01-09 14:38:54 +0100 CET   3d        ovh-s3-bucket

image

The backup succeed with your last docker image. However it also works last last so i dunno how to measure it. Any recommandation to control your fix?

Do you know why velero frequently is a lot of cpu ? What he is doing when he is not doing backup ?

@fwernert
Copy link
Author

fwernert commented Jan 9, 2025

I just also launch the fsb backup :
ovh-daily-fsb-backup-20250109143357 Completed 0 2 2025-01-09 15:33:57 +0100 CET 19d ovh-s3-bucket

@ywk253100
Copy link
Contributor

The backup succeed with your last docker image. However it also works last last so i dunno how to measure it. Any recommandation to control your fix?

Could you trigger more backups from all these three schedules (e.g. create one backup every 5 minutes) and check whether the same error will happen again?

Do you know why velero frequently is a lot of cpu ? What he is doing when he is not doing backup ?

There are some maintenance activities for the backup repositories (one namespace maps to one repository), the default frequency is one hour. And there are also schedule backups in the cluster, so per my understanding, seems this is the expected behavior. @Lyndon-Li Correct me if I'm wrong.

@fwernert
Copy link
Author

fwernert commented Jan 9, 2025

There are some maintenance activities for the backup repositories (one namespace maps to one repository), the default frequency is one hour. And there are also schedule backups in the cluster, so per my understanding, seems this is the expected behavior. @Lyndon-Li Correct me if I'm wrong.

Our scheduled backups are only at night.

Could you trigger more backups from all these three schedules (e.g. create one backup every 5 minutes) and check whether the same error will happen again?

I can try yes.

@fwernert
Copy link
Author

fwernert commented Jan 9, 2025

I ran 3 backups at 5 mins intervals

$ velero backup get
NAME                                       STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION     SELECTOR
ovh-daily-csi-snapshot-20250109162052      Completed    0        0          2025-01-09 17:20:53 +0100 CET   3d        ovh-s3-bucket        <none>

scaleway-daily-fsb-backup-20250109162248   Completed         0        13         2025-01-09 17:25:03 +0100 CET   19d       scaleway-s3-bucket   <none>

ovh-daily-fsb-backup-20250109162145        PartiallyFailed   26       13         2025-01-09 17:22:45 +0100 CET   19d       ovh-s3-bucket        <none>

type of Failure ( * 26 times)=

             name: /velero-redis-data-verifybadge-redis-master-0-h8sqc message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=pqp-verifybadge, name=velero-redis-data-verifybadge-redis-master-0-h8sqc): rpc error: code = Unknown desc = volumesnapshots.snapshot.storage.k8s.io "velero-redis-data-verifybadge-redis-master-0-h8sqc" not found

Then i just relaunch the PartillayFailed backup and :

ovh-daily-fsb-backup-20250109162935 Completed 0 13 2025-01-09 17:29:36 +0100 CET 19d ovh-s3-bucket <none>

I don't know what happened, but it seems there are no PVB errors now with your fix. 👍

ywk253100 added a commit to ywk253100/velero that referenced this issue Jan 10, 2025
…server to avoid API server issue

Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue

Fixes vmware-tanzu#8587

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
@ywk253100
Copy link
Contributor

@fwernert Thanks!

ywk253100 added a commit to ywk253100/velero that referenced this issue Jan 10, 2025
…server to avoid API server issue

Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue

Fixes vmware-tanzu#8587

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
@ywk253100 ywk253100 added this to the v1.16 milestone Jan 10, 2025
ywk253100 added a commit to ywk253100/velero that referenced this issue Jan 10, 2025
…server to avoid API server issue

Check the PVB status via podvolume Backupper rather than calling API server to avoid API server issue

Fixes vmware-tanzu#8587

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
@fwernert
Copy link
Author

Thank you @ywk253100 ! Could you tell us when 1.15.2 is ready ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants