-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add backoff mechanism for ProvReq retry #7182
Add backoff mechanism for ProvReq retry #7182
Conversation
1744723
to
9363a3b
Compare
defaultRetryTime = 10 * time.Minute | ||
defaultRetryTime = 1 * time.Minute | ||
maxBackoffTime = 10 * time.Minute | ||
maxCacheSize = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if that'll be sufficient long-term. It doesn't sound like it'll scale very well, since it effectively disables any backoff when there are more than 1k failing ProvReqs - i.e., precisely when a backoff would be most useful for preventing starvation of other requests.
Can you add a todo to clean up ProvReqs when they're resolved (succeed, time out, or are deleted)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm OK with merging it as is, but I think we should leave a note that it's a stopgap solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also removed elements that are provisioned or failed
c7dbebc
to
242edc5
Compare
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Please add a test case for this.
@@ -117,7 +117,8 @@ func TestProvisioningRequestPodsInjector(t *testing.T) { | |||
} | |||
for _, tc := range testCases { | |||
client := provreqclient.NewFakeProvisioningRequestClient(context.Background(), t, tc.provReqs...) | |||
injector := ProvisioningRequestPodsInjector{client, clock.NewFakePassiveClock(now)} | |||
backoffTime := map[string]time.Duration{key(notProvisionedRecentlyProvReqB): 2 * time.Minute} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a new test case for backed off request scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put backed off request scenario in a separate case
/approve cancel |
"k8s.io/client-go/rest" | ||
"k8s.io/klog/v2" | ||
"k8s.io/utils/clock" | ||
) | ||
|
||
const ( | ||
defaultRetryTime = 10 * time.Minute | ||
defaultRetryTime = 1 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it a flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is initial retry time. What the reasoning to make it configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case let's make maxBackoffTime
a flag too. And possibly maxCacheSize
as well, so the cluster admin can override it as a mitigation if they run into issues with overflowing cache.
Not that I like adding lots of flags to an already impressive collection, but Kueue depends on this feature and CA has a much higher fix-to-release latency and cost (or so it seems), so let's not risk getting stuck with hardcoded values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is context there: kubernetes-sigs/kueue#2931
Kueue got some requests for this retry time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaroslava-serdiuk I believe it's about kubernetes-sigs/kueue#2931 (reply in thread). But my previous comment stands, let's make it configurable and easy to mitigate. It's not like we have any data supporting this choice of hardcoded values AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "right" initial value depends on (at least) 2 factors:
- size of the cluster and, so, the number of incoming provisioning requests.
- the performance/throughput of provrequ processing.
We have no control over the first item, and the second item is being improved right now. So there is no easy way of knowing what value is good now, and, more importantly what value will be OK in 3 months. Having it hardcoded makes any adjustment/finetuning much harder.
242edc5
to
537f9f4
Compare
537f9f4
to
54e520c
Compare
54e520c
to
525dad2
Compare
/label tide/merge-method-squash |
for _, pr := range provReqs { | ||
if !isSupportedClass(pr) { | ||
klog.Warningf("Provisioning Class %s is not supported for ProvReq %s/%s", pr.Spec.ProvisioningClassName, pr.Namespace, pr.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to log the warning here. This method will be used in batch processing of check-capacity requests with a custom isSupportedClass
function, and we'd end up logging it for best-effort-atomic-scale-up requests.
Can we move logging this to the isSupportedClass
function defined in Process()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't anchor comment there but what I mean is:
func(pr *provreqwrapper.ProvisioningRequest) bool {
_, found := provisioningrequest.SupportedProvisioningClasses[pr.Spec.ProvisioningClassName]
if !found {
klog.Warningf("Provisioning Class %s is not supported for ProvReq %s/%s", pr.Spec.ProvisioningClassName, pr.Namespace, pr.Name)
}
return found
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for moving this to use LRU cache, looks much better! One small comment, otherwise it's good to go. Feel free to unhold when you're ready to merge this. /lgtm /hold |
/unhold |
/hold |
e0fb624
to
3520089
Compare
/unhold |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aleksandra-malinowska, yaroslava-serdiuk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature