Add backoff mechanism for ProvReq retry #7182

yaroslava-serdiuk · 2024-08-19T11:09:10Z

What type of PR is this?

/kind feature

yaroslava-serdiuk · 2024-08-19T11:10:51Z

/cc @aleksandra-malinowska

cluster-autoscaler/processors/provreq/injector.go

aleksandra-malinowska · 2024-08-19T12:42:16Z

cluster-autoscaler/processors/provreq/injector.go

-	defaultRetryTime = 10 * time.Minute
+	defaultRetryTime = 1 * time.Minute
+	maxBackoffTime   = 10 * time.Minute
+	maxCacheSize     = 1000


I'm not 100% sure if that'll be sufficient long-term. It doesn't sound like it'll scale very well, since it effectively disables any backoff when there are more than 1k failing ProvReqs - i.e., precisely when a backoff would be most useful for preventing starvation of other requests.

Can you add a todo to clean up ProvReqs when they're resolved (succeed, time out, or are deleted)?

To be clear, I'm OK with merging it as is, but I think we should leave a note that it's a stopgap solution.

Done, also removed elements that are provisioned or failed

yaroslava-serdiuk · 2024-08-20T08:58:42Z

/hold

aleksandra-malinowska

/lgtm
/approve

Please add a test case for this.

aleksandra-malinowska · 2024-08-20T09:16:10Z

cluster-autoscaler/processors/provreq/injector_test.go

@@ -117,7 +117,8 @@ func TestProvisioningRequestPodsInjector(t *testing.T) {
 	}
 	for _, tc := range testCases {
 		client := provreqclient.NewFakeProvisioningRequestClient(context.Background(), t, tc.provReqs...)
-		injector := ProvisioningRequestPodsInjector{client, clock.NewFakePassiveClock(now)}
+		backoffTime := map[string]time.Duration{key(notProvisionedRecentlyProvReqB): 2 * time.Minute}


Maybe add a new test case for backed off request scenario?

I put backed off request scenario in a separate case

aleksandra-malinowska · 2024-08-20T09:17:59Z

/approve cancel

mwielgus · 2024-09-05T09:55:20Z

cluster-autoscaler/processors/provreq/injector.go

 	"k8s.io/client-go/rest"
 	"k8s.io/klog/v2"
 	"k8s.io/utils/clock"
 )

 const (
-	defaultRetryTime = 10 * time.Minute
+	defaultRetryTime = 1 * time.Minute


Please make it a flag.

This is initial retry time. What the reasoning to make it configurable?

In this case let's make maxBackoffTime a flag too. And possibly maxCacheSize as well, so the cluster admin can override it as a mitigation if they run into issues with overflowing cache.

Not that I like adding lots of flags to an already impressive collection, but Kueue depends on this feature and CA has a much higher fix-to-release latency and cost (or so it seems), so let's not risk getting stuck with hardcoded values.

There is context there: kubernetes-sigs/kueue#2931
Kueue got some requests for this retry time.

@yaroslava-serdiuk I believe it's about kubernetes-sigs/kueue#2931 (reply in thread). But my previous comment stands, let's make it configurable and easy to mitigate. It's not like we have any data supporting this choice of hardcoded values AFAIK.

The "right" initial value depends on (at least) 2 factors:

size of the cluster and, so, the number of incoming provisioning requests.

the performance/throughput of provrequ processing.

We have no control over the first item, and the second item is being improved right now. So there is no easy way of knowing what value is good now, and, more importantly what value will be OK in 3 months. Having it hardcoded makes any adjustment/finetuning much harder.

cluster-autoscaler/main.go

cluster-autoscaler/processors/provreq/injector.go

yaroslava-serdiuk · 2024-09-10T11:12:35Z

/label tide/merge-method-squash

aleksandra-malinowska · 2024-09-11T08:14:45Z

cluster-autoscaler/processors/provreq/injector.go

 	for _, pr := range provReqs {
 		if !isSupportedClass(pr) {
+			klog.Warningf("Provisioning Class %s is not supported for ProvReq %s/%s", pr.Spec.ProvisioningClassName, pr.Namespace, pr.Name)


I don't think we want to log the warning here. This method will be used in batch processing of check-capacity requests with a custom isSupportedClass function, and we'd end up logging it for best-effort-atomic-scale-up requests.

Can we move logging this to the isSupportedClass function defined in Process() instead?

I can't anchor comment there but what I mean is:

func(pr *provreqwrapper.ProvisioningRequest) bool { _, found := provisioningrequest.SupportedProvisioningClasses[pr.Spec.ProvisioningClassName] if !found { klog.Warningf("Provisioning Class %s is not supported for ProvReq %s/%s", pr.Spec.ProvisioningClassName, pr.Namespace, pr.Name) } return found })

aleksandra-malinowska · 2024-09-11T08:20:20Z

Thanks for moving this to use LRU cache, looks much better! One small comment, otherwise it's good to go. Feel free to unhold when you're ready to merge this.

/lgtm
/approve

/hold

yaroslava-serdiuk · 2024-09-19T11:27:49Z

/unhold

yaroslava-serdiuk · 2024-09-19T11:34:37Z

/hold

yaroslava-serdiuk · 2024-09-19T12:37:12Z

/unhold

aleksandra-malinowska · 2024-09-23T08:14:27Z

/lgtm
/approve

k8s-ci-robot · 2024-09-23T08:14:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleksandra-malinowska, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [aleksandra-malinowska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 19, 2024

k8s-ci-robot requested review from feiskyer and vadasambar August 19, 2024 11:09

k8s-ci-robot added area/cluster-autoscaler size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 19, 2024

k8s-ci-robot requested a review from aleksandra-malinowska August 19, 2024 11:10

aleksandra-malinowska reviewed Aug 19, 2024

View reviewed changes

cluster-autoscaler/processors/provreq/injector.go Outdated Show resolved Hide resolved

aleksandra-malinowska reviewed Aug 19, 2024

View reviewed changes

cluster-autoscaler/processors/provreq/injector.go Outdated Show resolved Hide resolved

cluster-autoscaler/processors/provreq/injector.go Outdated Show resolved Hide resolved

yaroslava-serdiuk force-pushed the provreq-retry branch from 1744723 to 9363a3b Compare August 19, 2024 11:56

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 19, 2024

aleksandra-malinowska reviewed Aug 19, 2024

View reviewed changes

yaroslava-serdiuk force-pushed the provreq-retry branch 2 times, most recently from c7dbebc to 242edc5 Compare August 20, 2024 08:58

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2024

aleksandra-malinowska approved these changes Aug 20, 2024

View reviewed changes

k8s-ci-robot assigned aleksandra-malinowska Aug 20, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 20, 2024

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 20, 2024

mwielgus reviewed Sep 5, 2024

View reviewed changes

aleksandra-malinowska mentioned this pull request Sep 6, 2024

Subdivide injector pod list processor logic #7237

Merged

yaroslava-serdiuk force-pushed the provreq-retry branch from 242edc5 to 537f9f4 Compare September 9, 2024 12:30

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2024

Add backoff mechanism for ProvReq retry

03ff085

yaroslava-serdiuk force-pushed the provreq-retry branch from 537f9f4 to 54e520c Compare September 9, 2024 13:16

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2024

aleksandra-malinowska reviewed Sep 9, 2024

View reviewed changes

cluster-autoscaler/main.go Outdated Show resolved Hide resolved

aleksandra-malinowska reviewed Sep 9, 2024

View reviewed changes

cluster-autoscaler/processors/provreq/injector.go Outdated Show resolved Hide resolved

aleksandra-malinowska reviewed Sep 9, 2024

View reviewed changes

cluster-autoscaler/processors/provreq/injector.go Outdated Show resolved Hide resolved

yaroslava-serdiuk added 2 commits September 10, 2024 09:49

Add flags for intital and max backoff time, and cache size

519b80b

Review remarks

525dad2

yaroslava-serdiuk force-pushed the provreq-retry branch from 54e520c to 525dad2 Compare September 10, 2024 11:11

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 10, 2024

Add LRU cache

dd9822a

aleksandra-malinowska reviewed Sep 11, 2024

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 11, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024

Review remark

3520089

yaroslava-serdiuk force-pushed the provreq-retry branch from e0fb624 to 3520089 Compare September 19, 2024 12:25

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2024

k8s-ci-robot merged commit 04b1402 into kubernetes:master Sep 23, 2024
6 checks passed

aleksandra-malinowska mentioned this pull request Sep 23, 2024

Implemented batch processing for check capacity provisioning class #7283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backoff mechanism for ProvReq retry #7182

Add backoff mechanism for ProvReq retry #7182

yaroslava-serdiuk commented Aug 19, 2024

yaroslava-serdiuk commented Aug 19, 2024

aleksandra-malinowska Aug 19, 2024

aleksandra-malinowska Aug 19, 2024

yaroslava-serdiuk Aug 20, 2024

yaroslava-serdiuk commented Aug 20, 2024

aleksandra-malinowska left a comment

aleksandra-malinowska Aug 20, 2024

yaroslava-serdiuk Sep 9, 2024

aleksandra-malinowska commented Aug 20, 2024

mwielgus Sep 5, 2024

yaroslava-serdiuk Sep 5, 2024

aleksandra-malinowska Sep 5, 2024

tenzen-y Sep 5, 2024

aleksandra-malinowska Sep 5, 2024 •

edited

Loading

mwielgus Sep 5, 2024

yaroslava-serdiuk commented Sep 10, 2024

aleksandra-malinowska Sep 11, 2024

aleksandra-malinowska Sep 11, 2024

yaroslava-serdiuk Sep 11, 2024

aleksandra-malinowska commented Sep 11, 2024

yaroslava-serdiuk commented Sep 19, 2024

yaroslava-serdiuk commented Sep 19, 2024

yaroslava-serdiuk commented Sep 19, 2024

aleksandra-malinowska commented Sep 23, 2024

k8s-ci-robot commented Sep 23, 2024

Add backoff mechanism for ProvReq retry #7182

Add backoff mechanism for ProvReq retry #7182

Conversation

yaroslava-serdiuk commented Aug 19, 2024

What type of PR is this?

yaroslava-serdiuk commented Aug 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslava-serdiuk commented Aug 20, 2024

aleksandra-malinowska left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksandra-malinowska commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksandra-malinowska Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslava-serdiuk commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksandra-malinowska commented Sep 11, 2024

yaroslava-serdiuk commented Sep 19, 2024

yaroslava-serdiuk commented Sep 19, 2024

yaroslava-serdiuk commented Sep 19, 2024

aleksandra-malinowska commented Sep 23, 2024

k8s-ci-robot commented Sep 23, 2024

aleksandra-malinowska Sep 5, 2024 •

edited

Loading