Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mutating schedulingGates when the Jobset is suspended #623

Merged
merged 1 commit into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions pkg/webhooks/jobset_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,9 @@ func (j *jobSetWebhook) ValidateUpdate(ctx context.Context, old, newObj runtime.
mungedSpec.ReplicatedJobs[index].Template.Spec.Template.Labels = oldJS.Spec.ReplicatedJobs[index].Template.Spec.Template.Labels
mungedSpec.ReplicatedJobs[index].Template.Spec.Template.Spec.NodeSelector = oldJS.Spec.ReplicatedJobs[index].Template.Spec.Template.Spec.NodeSelector
mungedSpec.ReplicatedJobs[index].Template.Spec.Template.Spec.Tolerations = oldJS.Spec.ReplicatedJobs[index].Template.Spec.Template.Spec.Tolerations

// Pod Scheduling Gates can be updated for batch/v1 Job: https://github.com/kubernetes/kubernetes/blob/ceb58a4dbc671b9d0a2de6d73a1616bc0c299863/pkg/apis/batch/validation/validation.go#L662
mungedSpec.ReplicatedJobs[index].Template.Spec.Template.Spec.SchedulingGates = oldJS.Spec.ReplicatedJobs[index].Template.Spec.Template.Spec.SchedulingGates
Comment on lines +322 to +323
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the batch/job, we can mutate the schedulingGates only when the Job didn't start: https://github.com/kubernetes/kubernetes/blob/ceb58a4dbc671b9d0a2de6d73a1616bc0c299863/pkg/registry/batch/job/strategy.go#L194

So, shouldn't we introduce the same criteria for the mutable scheduling directive here, right?

Copy link
Contributor Author

@mimowo mimowo Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For JobSet we just check if the Job is suspended. This is done in the entry if above if ptr.Deref(oldJS.Spec.Suspend, false) {.

The other part of the check is based on the status.startTime, but this field is not in the JobSet API:

type JobSetStatus struct {
// +optional
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty"`
// Restarts tracks the number of times the JobSet has restarted (i.e. recreated in case of RecreateAll policy).
Restarts int32 `json:"restarts,omitempty"`
// RestartsCountTowardsMax tracks the number of times the JobSet has restarted that counts towards the maximum allowed number of restarts.
RestartsCountTowardsMax int32 `json:"restartsCountTowardsMax,omitempty"`
// TerminalState the state of the JobSet when it finishes execution.
// It can be either Complete or Failed. Otherwise, it is empty by default.
TerminalState string `json:"terminalState,omitempty"`
// ReplicatedJobsStatus track the number of JobsReady for each replicatedJob.
// +optional
// +listType=map
// +listMapKey=name
ReplicatedJobsStatus []ReplicatedJobStatus `json:"replicatedJobsStatus,omitempty"`
}
.

The difference between Job and JobSet is related to all of the fields (Annotations, Labels, NodeSelector, Tolerations too),.

Copy link
Contributor Author

@mimowo mimowo Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the check above only verifies that oldJS is suspended. Which means that, IIUC, there is a bug that we cannot modify the template on suspending (oldJS unsuspended, but js suspended). I think this could render failures in Kueue for the suspend, I will double-check and open a separate PR / Issue for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other part of the check is based on the status.startTime, but this field is not in the JobSet API:

Oh, I didn't know that! Thank you for the explanation!

I think this could render failures in Kueue for the suspend, I will double-check and open a separate PR / Issue for it.

I also feel that the behavior is a bug/regression. But, I agree with treating the regression as a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have confirmed that this currently makes the suspend fail for JobSet in Kueue: kubernetes-sigs/kueue#2691. I have opened an issue for JobSet #624, and the PR: #625. PTAL.

Copy link
Contributor Author

@mimowo mimowo Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is rather straightforward (and follows the patterns of other fields) so I wasn't planning an e2e test here.

I think once this PR lands, we could have an extended e2e test scenario added for the other PR, where we:

  1. unsuspend a JobSet adding scheduling gates, and other PodTemplate fields
  2. suspend the JobSet reverting the scheduling gates and other PodTemplate fields (as Kueue would do)
  3. unsuspend again the JobSet with different values of the PodTemplate fields and check that the pods executes (JobSet completes)

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit + integration tests should be sufficient here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo that sounds good. Maybe we create an issue documenting the goal of the e2e test so we can reference it.

Copy link
Contributor Author

@mimowo mimowo Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to just reference the proposal from the issue here: #624? I have added a comment there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit + integration tests should be sufficient here.

Ok, I have added the unit test dedicated for scheduling gates, while the integration test I extended the existing one for schedulingGates, because integration tests are heavier than unit, and the test case verifies that the mutation can happen, so any field that is immutable would fail the test (I confirmed it was failing before the change).

}
}

Expand Down
94 changes: 94 additions & 0 deletions pkg/webhooks/jobset_webhook_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1561,6 +1561,100 @@ func TestValidateUpdate(t *testing.T) {
field.Invalid(field.NewPath("spec").Child("replicatedJobs"), "", "field is immutable"),
}.ToAggregate(),
},
{
name: "schedulingGates for pod template can be updated for suspended JobSet",
js: &jobset.JobSet{
ObjectMeta: validObjectMeta,
Spec: jobset.JobSetSpec{
ReplicatedJobs: []jobset.ReplicatedJob{
{
Name: "test-jobset-replicated-job-0",
Replicas: 2,
Template: batchv1.JobTemplateSpec{
Spec: batchv1.JobSpec{
Parallelism: ptr.To[int32](2),
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
mimowo marked this conversation as resolved.
Show resolved Hide resolved
// Adding a scheduling gate
SchedulingGates: []corev1.PodSchedulingGate{
{
Name: "example.com/gate",
},
},
},
},
},
},
},
},
},
},
oldJs: &jobset.JobSet{
ObjectMeta: validObjectMeta,
Spec: jobset.JobSetSpec{
Suspend: ptr.To(true),
ReplicatedJobs: []jobset.ReplicatedJob{
{
Name: "test-jobset-replicated-job-0",
Replicas: 2,
Template: batchv1.JobTemplateSpec{
Spec: batchv1.JobSpec{
Parallelism: ptr.To[int32](2),
},
},
},
},
},
},
},
{
name: "schedulingGates for pod template cannot be updated for unsuspended JobSet",
js: &jobset.JobSet{
ObjectMeta: validObjectMeta,
Spec: jobset.JobSetSpec{
ReplicatedJobs: []jobset.ReplicatedJob{
{
Name: "test-jobset-replicated-job-0",
Replicas: 2,
Template: batchv1.JobTemplateSpec{
Spec: batchv1.JobSpec{
Parallelism: ptr.To[int32](2),
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
// Adding a scheduling gate
SchedulingGates: []corev1.PodSchedulingGate{
{
Name: "example.com/gate",
},
},
},
},
},
},
},
},
},
},
oldJs: &jobset.JobSet{
ObjectMeta: validObjectMeta,
Spec: jobset.JobSetSpec{
ReplicatedJobs: []jobset.ReplicatedJob{
{
Name: "test-jobset-replicated-job-0",
Replicas: 2,
Template: batchv1.JobTemplateSpec{
Spec: batchv1.JobSpec{
Parallelism: ptr.To[int32](2),
},
},
},
},
},
},
want: field.ErrorList{
field.Invalid(field.NewPath("spec").Child("replicatedJobs"), "", "field is immutable"),
}.ToAggregate(),
},
{
name: "replicated job name cannot be updated",
js: &jobset.JobSet{
Expand Down
5 changes: 5 additions & 0 deletions test/integration/webhook/jobset_webhook_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,11 @@ var _ = ginkgo.Describe("jobset webhook defaulting", func() {
},
updateJobSet: func(js *jobset.JobSet) {
js.Spec.ReplicatedJobs[0].Template.Spec.Template.Annotations["new"] = "annotation"
js.Spec.ReplicatedJobs[0].Template.Spec.Template.Spec.SchedulingGates = []corev1.PodSchedulingGate{
{
Name: "example.com/gate",
},
}
},
}),
ginkgo.Entry("updating pod template in running jobset is not allowed", &testCase{
Expand Down