Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Support the PodSpecOverrides API in TrainJob #2218

Open
andreyvelich opened this issue Aug 14, 2024 · 7 comments
Open

KEP-2170: Support the PodSpecOverrides API in TrainJob #2218

andreyvelich opened this issue Aug 14, 2024 · 7 comments
Assignees

Comments

@andreyvelich
Copy link
Member

Related: #2170

We should be able to configure Pod spec overrides for TrainJob via PodSpecOverrides API.

/area controller

@shravan-achar
Copy link

/assign

@tenzen-y
Copy link
Member

tenzen-y commented Oct 30, 2024

/assign

Thank you for taking this issue. Before impl, could you update our design? Actually we did not decide the solutions in the situations where there are some conflicts or something.

So, It would be better to define the order of overriding and what if there are conflicts in the design proposal.

@andreyvelich
Copy link
Member Author

Yeah, I think it is a good idea to explain how the PodSpecOverride API behaves.
@shravan-achar Please can you update the KEP before the implementation ?

@shravan-achar
Copy link

Makes sense. Will work on updating the KEP

@shravan-achar
Copy link

shravan-achar commented Jan 8, 2025

Although it is called podSpecOverrides, can we use this API to support customizing the trainer image? Specifically, discussion on this PR and this issue aims to track use-cases where customizing entrypoint, packages or the working directory is needed in the context of interactive trainJob workloads.

In general, are there other means today to customize a trainer image for rapid iteration using TrainJob other than building a new image?

What are your thoughts @andreyvelich @tenzen-y ?

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 8, 2025

Although it is called podSpecOverrides, can we use this API to support customizing the trainer image?

What kind of customization do you mean ? I remember, that we discussed before that parameters from the Trainer API take precedence over PodSpecOverrides: https://github.com/kubeflow/training-operator/blob/master/docs/proposals/2170-kubeflow-training-v2/README.md?plain=1#L900

Or you are talking about other customizations ?

@shravan-achar
Copy link

Other customizations such as adding a few packages and specifying a working directory for the PyTorch based TrainJob, for example.

It's possible to do this indirectly with changing the container command (say add a pip command), but this is not data scientist / ML engineer friendly as they need to know more about Kubernetes. I was wondering if we can provide a first-class way to specify the runtime environment (different from the trainingRuntime)

On the other hand, it might be okay to consider the two things as conflated. We could consider podSpecOverrides to simply support webhooks to inject custom containers / env on train jobs, and have a way to specify the trainJob runtime environment through a different API on the trainJob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants