Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ways to validate Job templates at JobSet validation time #422

Open
Tracked by #438 ...
danielvegamyhre opened this issue Feb 13, 2024 · 15 comments
Open
Tracked by #438 ...
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@danielvegamyhre
Copy link
Contributor

Currently we don't have a good way of validating the Job templates at JobSet validation time, so JobSets containing some error in the Job template somewhere are created successfully anyway, giving the user the false impression their workload spec is valid, only for them to be confused when no pods are created (either because the Job spec was invalid, or the pod spec was invalid, etc).

Debugging what the issue is requires checking the JobSet controller logs, which is not ideal.

We should see if we can validate at least some key parts of the Job template at JobSet creation time, in order to improve the user experience by rejecting the JobSet creation and giving them a clear error message.

@kannon92
Copy link
Contributor

I think we should consider bringing this to upstream..

it would be nice to be able to call the same validation logic in kube. Maybe we could consider moving the validation functions to staging and allow people to call it and return the validation errors…

@danielvegamyhre
Copy link
Contributor Author

For now I think the path of least resistance is to re-implement some of the core upstream Job validation logic into the JobSet admission webhook. If the upstream logic is ever available for higher level APIs like JobSet to use, we can revisit this.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Mar 13, 2024

I think this function encapsulates most of the logic we want to port over. Referencing 1.26 validation since that is our oldest supported version, and we don't want our validation logic referencing fields introduced in later versions.

@danielvegamyhre
Copy link
Contributor Author

/assign

@danielvegamyhre
Copy link
Contributor Author

Update on this: I have a working prototype using kubectl-validate packages to set up an openapiclient aware of the JobSet CRD and validate the CRD (which has the Job template embedded) and use that to create a validator. The time consuming part will now be updating all of our unit tests to actually use valid JobSet specs in order to pass this validation (e.g. containers cannot be empty, etc).

I discussed our use case with one of the developers of kubectl-validate and learned that it will only perform basic validation, enforcing the CRD manifest rules defining field data types, if they are required/optional etc.

More advanced conditional validation like the apiserver performs is part of the future work planned for kubectl-validate. I still think this will be useful to catch these categories of errors at JobSet creation time.

@ahg-g
Copy link
Contributor

ahg-g commented Mar 20, 2024

This sounds like the right approach!

I discussed our use case with one of the developers of kubectl-validate and learned that it will only perform basic validation, enforcing the CRD manifest rules defining field data types, if they are required/optional etc.

So no validation of field values? like setting an incorrect PodFailurePolicyAction for pod failure policy?

@danielvegamyhre
Copy link
Contributor Author

This sounds like the right approach!

I discussed our use case with one of the developers of kubectl-validate and learned that it will only perform basic validation, enforcing the CRD manifest rules defining field data types, if they are required/optional etc.

So no validation of field values? like setting an incorrect PodFailurePolicyAction for pod failure policy?

I just tested setting a pod failure policy action of InvalidAction and the validation passed, so yes it won't catch this error.

@kannon92
Copy link
Contributor

Hmm. So kubectl-validate will detect if someone didn't specify required fields for a Job template/Pod Template?

@danielvegamyhre
Copy link
Contributor Author

Hmm. So kubectl-validate will detect if someone didn't specify required fields for a Job template/Pod Template?

Yep, for example if you leave "containers" empty in your pod template it catches that. It also catches empty required fields in JobSet itself, although we default those.

@danielvegamyhre
Copy link
Contributor Author

Update on this: I synced with kubectl-validate maintainers and aligned on prioritizing improved validation for k8s native types (e.g. JobTemplate in this case), so it will more closely replicate some of the conditional validation done by the apiserver. They don't think 100% fidelity with apiserver validation will be possible in the near future, but we can get partial fidelity (perhaps 80%), which would be a huge improvement for JobSet.

They are working on a prototype which should be ready for testing in a few weeks.

@danielvegamyhre
Copy link
Contributor Author

Update: saw a demo of kubectl-validate feature requested for this use case and it looks promising. Just received instructions for how to include a custom build in JobSet to test with, so I plan on doing that soon.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 4, 2024
@danielvegamyhre
Copy link
Contributor Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants