Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-672: Implement the DependsOn API #740

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

andreyvelich
Copy link
Contributor

@andreyvelich andreyvelich commented Dec 27, 2024

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

I added implementation for the DependsOn API. The majority of validations are implemented using CEL.

TODO list:

  • Add unit tests for the depends_on.go
  • Add controller integration tests to verify CEL validation.
  • Add controller integration tests to verify Job creation.

/cc @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch

Which issue(s) this PR fixes:

Fixes: #672

Does this PR introduce a user-facing change?

Add the DependsOn API to support serial execution of ReplicatedJobs.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 27, 2024
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g December 27, 2024 13:17
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Dec 27, 2024
@k8s-ci-robot
Copy link
Contributor

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: vsoch.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

I added implementation for the DependsOn API. The majority of validations are implemented using CEL.

TODO list:

  • Add unit tests for the depends_on.go
  • Add controller integration tests to verify CEL validation.
  • Add controller integration tests to verify Job creation.

/cc @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch

Which issue(s) this PR fixes:

Fixes: #672

Does this PR introduce a user-facing change?

Add the DependsOn API to support serial execution of ReplicatedJobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @andreyvelich. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 27, 2024
Copy link

netlify bot commented Dec 27, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit d76b286
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/677c5b97ebcb1b000823f38e


// Complete status means the Succeeded counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded
CompleteStatus DependsOnStatus = "Complete"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about Complete status here, given the discussion in: #723 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to match the Kubernetes Job and use Complete, so keep as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that is fine.

@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2024
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 2, 2025
@andreyvelich
Copy link
Contributor Author

I've added integration and unit tests. Let me know if that looks good to you!
/assign @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch @shravan-achar @akshaychitneni

@k8s-ci-robot
Copy link
Contributor

@andreyvelich: GitHub didn't allow me to assign the following users: vsoch, shravan-achar, akshaychitneni.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

I've added integration and unit tests. Let me know if that looks good to you!
/assign @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch @shravan-achar @akshaychitneni

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andreyvelich andreyvelich changed the title [WIP] KEP-672: Implement the DependsOn API KEP-672: Implement the DependsOn API Jan 3, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025

// Complete status means the Succeeded counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded
CompleteStatus DependsOnStatus = "Complete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to match the Kubernetes Job and use Complete, so keep as is.

@@ -417,5 +417,126 @@ var _ = ginkgo.Describe("jobset webhook defaulting", func() {
},
updateShouldFail: true,
}),
ginkgo.Entry("DependsOn and StartupPolicy can't be set together", &testCase{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case for StartupPolicy set to AnyOrder and allowing DependsOn

// only after the referenced ReplicatedJobs reach their desired state.
// The Order of ReplicatedJobs is defined by their enumeration in the slice.
// Note, that the first ReplicatedJob in the slice cannot use the DependsOn API.
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have TODO in user facing APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove it.

// only after the referenced ReplicatedJobs reach their desired state.
// The Order of ReplicatedJobs is defined by their enumeration in the slice.
// Note, that the first ReplicatedJob in the slice cannot use the DependsOn API.
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list.
// Currently, only a single item is supported in the DependsOn list.

// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].ready
ReadyStatus DependsOnStatus = "Ready"

// Complete status means the Succeeded counter equals the number of child Jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is actually true. A ReplicatedJob can be mark as succeeded if the success policy passes which doesn't necessarly mean that all the replicated jobs finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ReplicatedJob can be mark as succeeded if the success policy passes which doesn't necessarly mean that all the replicated jobs finished.

Can you give me an example of this use-case? From my understanding, ReplicatedJob completion is based on ReplicatedJob's SuccessPolicy: https://kubernetes.io/docs/concepts/workloads/controllers/job/#success-policy.
Which means, ReplicatedJob will be in Complete status when this policy is met.

SuccessPolicy on the JobSet level just sets whether we should mark JobSet as Completed, when All or Any ReplicatedJobs reach Complete status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. You are correct.


const (
// Ready status means the Ready counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the boolean expression is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ?

Should we modify this as follows:

.spec.replicatedJobs["name==<JOB_NAME>"].replicas == 
.status.replicatedJobsStatus.name["name==<JOB_NAME>"].ready + 
.status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded + 
.status.replicatedJobsStatus.name["name==<JOB_NAME>"].failed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadyStatus DependsOnStatus = "Ready"

// Complete status means the Succeeded counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the boolean expression is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean for user-facing docs. Sure, I think we can remove it.

replicatedJobNames[rJob.Name] = rIdx
// Check that DependsOn references the previous ReplicatedJob.
if rIdx > 0 && rJob.DependsOn != nil {
dependsOnIdx, ok := replicatedJobNames[rJob.DependsOn[0].Name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dependsOnIdx, ok := replicatedJobNames[rJob.DependsOn[0].Name]
dependsOnIdx, ok := replicatedJobNames[rJob.DependsOn[rIdx - 1].Name]

I think you want to look at the previous job and not just the 0th element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, DependsOn is an array which always has a single element, as you can see here:

// +kubebuilder:validation:MaxItems=1

Does it make sense @kannon92 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. I got my streams crossed.

rJobsReplicas[replicatedJob.Name] = replicatedJob.Replicas

// For depends on, the Job is created only after the previous replicatedJob reached the status.
if replicatedJob.DependsOn != nil && !isDependsOnJobReachedStatus(replicatedJob.DependsOn[0], rJobsReplicas[replicatedJob.DependsOn[0].Name], replicatedJobStatuses) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug. You probably want the idx - 1 rather than 0.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how important this is for kubeflow, I am wondering if a e2e test would be useful.

@@ -1487,10 +1487,152 @@ func TestValidateCreate(t *testing.T) {
},
}

dependsOnTests := []validationTestCase{
{
name: "dependsOn is valid",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a case where job-2 waits for job-1 and job-3 waits for job-2?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andreyvelich, vsoch
Once this PR has been reviewed and has the lgtm label, please ask for approval from ahg-g. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich andreyvelich force-pushed the issue-672-implement-depends-on branch from 76d53e5 to bf743bb Compare January 6, 2025 15:27
@andreyvelich
Copy link
Contributor Author

Given how important this is for kubeflow, I am wondering if a e2e test would be useful.

Yeah, maybe we could add one e2e test as part of JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/test/e2e/e2e_test.go.
Will add it soon.

@andreyvelich andreyvelich force-pushed the issue-672-implement-depends-on branch from c0b8760 to d76b286 Compare January 6, 2025 22:39
@andreyvelich
Copy link
Contributor Author

@kannon92 @tenzen-y I've added the simple E2E test for DependsOn API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for the execution policy API in JobSet
7 participants