-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Support the PodSpecOverrides API in TrainJob #2218
Comments
/assign |
Thank you for taking this issue. Before impl, could you update our design? Actually we did not decide the solutions in the situations where there are some conflicts or something. So, It would be better to define the order of overriding and what if there are conflicts in the design proposal. |
Yeah, I think it is a good idea to explain how the PodSpecOverride API behaves. |
Makes sense. Will work on updating the KEP |
Although it is called podSpecOverrides, can we use this API to support customizing the trainer image? Specifically, discussion on this PR and this issue aims to track use-cases where customizing entrypoint, packages or the working directory is needed in the context of interactive trainJob workloads. In general, are there other means today to customize a trainer image for rapid iteration using TrainJob other than building a new image? What are your thoughts @andreyvelich @tenzen-y ? |
What kind of customization do you mean ? I remember, that we discussed before that parameters from the Trainer API take precedence over PodSpecOverrides: https://github.com/kubeflow/training-operator/blob/master/docs/proposals/2170-kubeflow-training-v2/README.md?plain=1#L900 Or you are talking about other customizations ? |
Other customizations such as adding a few packages and specifying a working directory for the PyTorch based TrainJob, for example. It's possible to do this indirectly with changing the container command (say add a On the other hand, it might be okay to consider the two things as conflated. We could consider podSpecOverrides to simply support webhooks to inject custom containers / env on train jobs, and have a way to specify the trainJob runtime environment through a different API on the trainJob |
Related: #2170
We should be able to configure Pod spec overrides for TrainJob via
PodSpecOverrides
API./area controller
The text was updated successfully, but these errors were encountered: