Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Trainer subclass for Decoder-Only LM with generation in evaluate() #33396

Open
skpig opened this issue Sep 10, 2024 · 4 comments · May be fixed by #32346
Open

A Trainer subclass for Decoder-Only LM with generation in evaluate() #33396

skpig opened this issue Sep 10, 2024 · 4 comments · May be fixed by #32346
Labels
Feature request Request for a new feature trainer

Comments

@skpig
Copy link
Contributor

skpig commented Sep 10, 2024

Feature request

The main feature request involves a New Trainer Subclass, similar to Seq2SeqTrainer, but suitable for Decoder-Only LM.

Motivation

Seq2SeqTrainer provides a great abstraction for Encoder-Decoder LM, when we need to conduct generation during evaluate()

But the current implementation of both Trainer and Seq2SeqTrainer seems to be not suitable for Decoder-Only LM due to the difference of input_ids and labels between teacher-forcing training and generation-involved evaluation.

For example in instruction tuning:

  • During training (teacher-forcing)
input_ids = 'Translation the following texts: {Text in Chinese...} {Text in English...}'
labels = 'Translation the following texts: {Text in Chinese...} {Text in English...}'
  • During evaluation
input_ids = 'Translation the following texts: {Text in Chinese...}'
labels = '{Text in English...}'

So we need to prepare two kinds of inputs_ids during evaluation for calculation of both loss and bleu_metrics. It leads to different columns in eval_dataset. However, Trainer._remove_unused_columns() will remove columns for both eval_dataset and train_dataset not accepted by model.forward(). During training, this behaviour is expected (we only need the teacher-forcing inputs). But it will make evaluation difficult.

This feature is nearly identical across all CausalLM models when performing generation during evaluation, making it highly reusable. Given the increasing number of Decoder-only LMs (CausalLMs) in the community, I strongly recommend implementing a dedicated CausalTrainer to simplify deployments.

I may have missed something. If there is already a simpler way to customize such a Trainer, please let me know.

Your contribution

I'm willing to help submit a PR. But I'm not familiar with some integrations such as fsdp and deepspeed. I may need someone to help me finish this feature.

@skpig skpig added the Feature request Request for a new feature label Sep 10, 2024
@LysandreJik
Copy link
Member

cc @muellerzr regarding the Trainer question, but maybe @Rocketknight1 can shed light on the NLP/training aspect

@zucchini-nlp
Copy link
Member

Guess the feature requested is same as in #32346

@Rocketknight1
Copy link
Member

Looks that way to me - we can add a note to #32346 that it fixes this issue! @skpig is there anything missing in #32346 that you need for your use-cases?

@skpig
Copy link
Contributor Author

skpig commented Sep 11, 2024

Fantastic, that's exactly what I need! Look forward to the new feature! 😊

@zucchini-nlp zucchini-nlp linked a pull request Sep 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature trainer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants