-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-672: Implement the DependsOn API #740
base: main
Are you sure you want to change the base?
KEP-672: Implement the DependsOn API #740
Conversation
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: vsoch. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @andreyvelich. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded | ||
CompleteStatus DependsOnStatus = "Complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about Complete
status here, given the discussion in: #723 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to match the Kubernetes Job and use Complete, so keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that is fine.
/ok-to-test |
I've added integration and unit tests. Let me know if that looks good to you! |
@andreyvelich: GitHub didn't allow me to assign the following users: vsoch, shravan-achar, akshaychitneni. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded | ||
CompleteStatus DependsOnStatus = "Complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to match the Kubernetes Job and use Complete, so keep as is.
@@ -417,5 +417,126 @@ var _ = ginkgo.Describe("jobset webhook defaulting", func() { | |||
}, | |||
updateShouldFail: true, | |||
}), | |||
ginkgo.Entry("DependsOn and StartupPolicy can't be set together", &testCase{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case for StartupPolicy set to AnyOrder and allowing DependsOn
api/jobset/v1alpha2/jobset_types.go
Outdated
// only after the referenced ReplicatedJobs reach their desired state. | ||
// The Order of ReplicatedJobs is defined by their enumeration in the slice. | ||
// Note, that the first ReplicatedJob in the slice cannot use the DependsOn API. | ||
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have TODO in user facing APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will remove it.
api/jobset/v1alpha2/jobset_types.go
Outdated
// only after the referenced ReplicatedJobs reach their desired state. | ||
// The Order of ReplicatedJobs is defined by their enumeration in the slice. | ||
// Note, that the first ReplicatedJob in the slice cannot use the DependsOn API. | ||
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list. | |
// Currently, only a single item is supported in the DependsOn list. |
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].ready | ||
ReadyStatus DependsOnStatus = "Ready" | ||
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is actually true. A ReplicatedJob can be mark as succeeded if the success policy passes which doesn't necessarly mean that all the replicated jobs finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ReplicatedJob can be mark as succeeded if the success policy passes which doesn't necessarly mean that all the replicated jobs finished.
Can you give me an example of this use-case? From my understanding, ReplicatedJob completion is based on ReplicatedJob's SuccessPolicy: https://kubernetes.io/docs/concepts/workloads/controllers/job/#success-policy.
Which means, ReplicatedJob will be in Complete
status when this policy is met.
SuccessPolicy on the JobSet level just sets whether we should mark JobSet as Completed
, when All or Any ReplicatedJobs reach Complete
status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. You are correct.
api/jobset/v1alpha2/jobset_types.go
Outdated
|
||
const ( | ||
// Ready status means the Ready counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the boolean expression is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ?
Should we modify this as follows:
.spec.replicatedJobs["name==<JOB_NAME>"].replicas ==
.status.replicatedJobsStatus.name["name==<JOB_NAME>"].ready +
.status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded +
.status.replicatedJobsStatus.name["name==<JOB_NAME>"].failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
api/jobset/v1alpha2/jobset_types.go
Outdated
ReadyStatus DependsOnStatus = "Ready" | ||
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the boolean expression is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean for user-facing docs. Sure, I think we can remove it.
replicatedJobNames[rJob.Name] = rIdx | ||
// Check that DependsOn references the previous ReplicatedJob. | ||
if rIdx > 0 && rJob.DependsOn != nil { | ||
dependsOnIdx, ok := replicatedJobNames[rJob.DependsOn[0].Name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependsOnIdx, ok := replicatedJobNames[rJob.DependsOn[0].Name] | |
dependsOnIdx, ok := replicatedJobNames[rJob.DependsOn[rIdx - 1].Name] |
I think you want to look at the previous job and not just the 0th element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, DependsOn is an array which always has a single element, as you can see here:
jobset/api/jobset/v1alpha2/jobset_types.go
Line 245 in d751c13
// +kubebuilder:validation:MaxItems=1 |
Does it make sense @kannon92 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes. I got my streams crossed.
rJobsReplicas[replicatedJob.Name] = replicatedJob.Replicas | ||
|
||
// For depends on, the Job is created only after the previous replicatedJob reached the status. | ||
if replicatedJob.DependsOn != nil && !isDependsOnJobReachedStatus(replicatedJob.DependsOn[0], rJobsReplicas[replicatedJob.DependsOn[0].Name], replicatedJobStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug. You probably want the idx - 1 rather than 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how important this is for kubeflow, I am wondering if a e2e test would be useful.
pkg/webhooks/jobset_webhook_test.go
Outdated
@@ -1487,10 +1487,152 @@ func TestValidateCreate(t *testing.T) { | |||
}, | |||
} | |||
|
|||
dependsOnTests := []validationTestCase{ | |||
{ | |||
name: "dependsOn is valid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a case where job-2 waits for job-1 and job-3 waits for job-2?
Improve API docs
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andreyvelich, vsoch The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
76d53e5
to
bf743bb
Compare
Yeah, maybe we could add one e2e test as part of JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/test/e2e/e2e_test.go. |
c0b8760
to
d76b286
Compare
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
I added implementation for the DependsOn API. The majority of validations are implemented using CEL.
TODO list:
depends_on.go
/cc @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch
Which issue(s) this PR fixes:
Fixes: #672
Does this PR introduce a user-facing change?