Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer: add predict with generate #32346

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jul 31, 2024

What does this PR do?

Fixes #26474, fixes #31462, fixes #33396 and fixes #31672. This PR adds possibility to generate and compute metrics on generated texts for decoder-only models.

The basic idea is almost same as in Seq2Seq Trainer, but decoder-only models need a prompt-only input for generation. While for loss computation we need the whole input. Therefore we can ask users to prepare train and eval datasets, so that the eval contains generation_inputs used for generation. Additionally, to make user's life easier, I added a possibility to pass in different collators for train and for eval/test datasets.

The args used for generation should be set via GenerationConfig, as imo that makes most sense instead of adding only max_length and num_beams as in Seq2SeqTrainer.

The code was tested with the below dummy train script.

import random
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration, TrainingArguments, Trainer

DEVICE = "cuda:0"
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
pad_token_id = processor.tokenizer.pad_token_id

lora_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
    use_dora=False,
    init_lora_weights="gaussian"
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.float16,
    quantization_config=bnb_config,
)

model.add_adapter(lora_config)
model.enable_adapters()


eval_dataset = load_dataset("nielsr/docvqa_1200_examples", split="test")
eval_dataset = eval_dataset.remove_columns(['id', 'words', 'bounding_boxes', 'answer'])
eval_dataset = eval_dataset.select(range(10))


class DataCollatorForGeneration:
    def __init__(self, processor, eval_mode=False):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        images = []
        for example in examples:
            image = example["image"]
            question = example["query"]["en"]
            answer = random.choice(example["answers"])
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Answer briefly."},
                        {"type": "image"},
                        {"type": "text", "text": question}
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": answer}
                    ]
                }
            ]
            text = processor.apply_chat_template(messages, add_generation_prompt=False)
            text_eval = processor.apply_chat_template([messages[0]], add_generation_prompt=True)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            images.append([image])

        # Make sure we have right padding in train and left padding for eval parts
        processor.tokenizer.padding_side = "right"
        batch = processor(text=texts, images=images, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            processor.tokenizer.padding_side = "left"
            batch_eval = processor(text=texts, images=images, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']

        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels

        return batch

gen_config = model.generation_config
gen_config.max_length = 200

training_args = TrainingArguments(
    max_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=5,
    gradient_accumulation_steps=2,
    output_dir="tmp_delete",
    eval_strategy="steps",
    fp16=True,
    remove_unused_columns=False,
    report_to="none",
    predict_with_generate=True,
    generation_config=gen_config,
)

def custom_metrics(prediction_dict):
    # unmask for correct detokenization, because preds are padded to max length with -100
    preds = prediction_dict.predictions
    preds[preds == -100] = pad_token_id
    lbls = prediction_dict.label_ids
    lbls[lbls == -100] = pad_token_id

    # Decode and do magic for metrics
    preds = processor.batch_decode(preds)
    lbls = processor.batch_decode(lbls)
    bleu = rouge = 0
    return {"bleu" : bleu, "rouge": rouge}


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForGeneration(processor),
    eval_data_collator=DataCollatorForGeneration(processor, eval_mode=True),
    train_dataset=eval_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=custom_metrics,
)

print(trainer.evaluate())

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp requested a review from LysandreJik July 31, 2024 07:20
@LysandreJik
Copy link
Member

This seems good to me from a quick read, pinging @SunMarc and @muellerzr who have more experience with that code than I do.

@salrowili
Copy link

Can you please show an example how this code can work with HF text dataset (not the multimodal dataset) without Idefics2 processor? I mean using tokenizer.apply_chat_template ? how right and left padding would be handle in this case?

@zucchini-nlp
Copy link
Member Author

@salrowili it should be similar to Idefics with the only difference that instead of processor.tokenizer you have simply tokenizer. The main thing to note is that Trainer needs inputs for loss calculation which are prepared same was as they were always, and also needs inputs for generation which should be prepended with generation_ prefix and left-padded

Below is a modified version of Idefics script, should work for text models

class DataCollatorForGeneration:
    def __init__(self, tokenizer, eval_mode=False):
        self.tokenizer = tokenizer
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        images = []
        for example in examples:
            question = example["query"]["en"]
            answer = random.choice(example["answers"])
            messages = [
                {
                    "role": "user",
                    "content": f"Answer question: {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            text = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            text_eval = tokenizer.apply_chat_template([messages[0]], add_generation_prompt=True)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            images.append([image])

        # Make sure we have right padding in train and left padding for eval parts
        tokenizer.padding_side = "right"
        batch = tokenizer(text=texts, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            tokenizer.padding_side = "left"
            batch_eval = tokenizer(text=texts, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']

        labels = batch["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
        batch["labels"] = labels

@salrowili
Copy link

salrowili commented Aug 2, 2024

@salrowili it should be similar to Idefics with the only difference that instead of processor.tokenizer you have simply tokenizer. The main thing to note is that Trainer needs inputs for loss calculation which are prepared same was as they were always, and also needs inputs for generation which should be prepended with generation_ prefix and left-padded

Below is a modified version of Idefics script, should work for text models

class DataCollatorForGeneration:
    def __init__(self, tokenizer, eval_mode=False):
        self.tokenizer = tokenizer
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        images = []
        for example in examples:
            question = example["query"]["en"]
            answer = random.choice(example["answers"])
            messages = [
                {
                    "role": "user",
                    "content": f"Answer question: {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            text = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
            text_eval = tokenizer.apply_chat_template([messages[0]], add_generation_prompt=True)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            images.append([image])

        # Make sure we have right padding in train and left padding for eval parts
        tokenizer.padding_side = "right"
        batch = tokenizer(text=texts, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            tokenizer.padding_side = "left"
            batch_eval = tokenizer(text=texts, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']

        labels = batch["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
        batch["labels"] = labels

@zucchini-nlp Thank you for the update. I have added some lines to the code to make a complete example for QA task.

from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig,TrainingArguments, Trainer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
from peft import LoraConfig
import torch
from torchmetrics.text import SQuAD
from random import randrange
from transformers.utils import logging

dataset = load_dataset("Stanford/web_questions")

train_dataset=dataset["train"]
eval_dataset=dataset["test"]

eval_dataset = eval_dataset.select(range(256))

quant_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_type=torch.bfloat16
)


model_id="meta-llama/Meta-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.add_special_tokens({"pad_token":"</s>"})
pad_token_id = tokenizer.pad_token_id


model.resize_token_embeddings(len(tokenizer),pad_to_multiple_of=8)


gen_config=gen_config = model.generation_config
gen_config.max_length = 256

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model.add_adapter(peft_config)
model.enable_adapters()

tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ '  ' }}{% endif %}{% endfor %}{{ eos_token }}"

class DataCollatorForGeneration:
    def __init__(self, tokenizer, eval_mode=False):
        self.tokenizer = tokenizer
        self.eval_mode = eval_mode

    def __call__(self, examples):
        texts, texts_eval = [], []
        for example in examples:
            question = example["question"]
            answer = example["answers"][0] ### webquestion dataset has multiple answers so to make the code simple we choost the first answer
            messages = [
                {
                    "role": "user",
                    "content": f"Answer the following question: {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            
            text = tokenizer.apply_chat_template(messages, add_generation_prompt=False,tokenize=False)
            text_eval = tokenizer.apply_chat_template([messages[0]],add_generation_prompt=True,tokenize=False)
            texts.append(text.strip())
            texts_eval.append(text_eval.strip())
            ## uncomment to check template format
            # print(text)
            #print(text_eval)
            #exit()
        # Make sure we have right padding in train and left padding for eval parts
        tokenizer.padding_side = "right"
        batch = tokenizer(text=texts, return_tensors="pt", padding=True) 
        
        if self.eval_mode:
            tokenizer.padding_side = "left"
            batch_eval = tokenizer(text=texts_eval, return_tensors="pt", padding=True)
            batch['generation_input_ids'] = batch_eval['input_ids']
            batch['generation_attention_mask'] = batch_eval['attention_mask']
        labels = batch["input_ids"].clone()
        labels[labels == tokenizer.pad_token_id] = -100 # Ignore index for CE-loss
        batch["labels"] = labels
        return batch




def custom_metrics(prediction_dict):
    # unmask for correct detokenization, because preds are padded to max length with -100
    preds = prediction_dict.predictions
    preds[preds == -100] = pad_token_id
    lbls = prediction_dict.label_ids
    lbls[lbls == -100] = pad_token_id

    # Decode and do magic for metrics
    preds = tokenizer.batch_decode(preds,skip_special_tokens=True)
    lbls = tokenizer.batch_decode(lbls,skip_special_tokens=True)
    ## uncomment if you want to see all special token (e.g, EOS)
    #preds = tokenizer.batch_decode(preds)
    #lbls = tokenizer.batch_decode(lbls)
    print("\n\n\n",'=' * 40,"Labels",'=' * 40)
    for item_x in lbls[:5]:
        print(item_x,"\n")
    print("\n",'=' * 40,"Predictions",'=' * 40)
    for item_x in preds[:5]:
        print(item_x,"\n")
    print("\n",'=' * 80)
    pred_list=[]
    label_list=[]
    idx=0
    ## visit https://lightning.ai/docs/torchmetrics/stable/text/squad.html for reference ##
    for x,y in zip(preds,lbls):
       pred_list.append({"prediction_text": x.split("?")[1], "id": idx})
       label_list.append({"answers": {"text": [y.split("?")[1]]}, "id": idx})
       squad = SQuAD()(pred_list,label_list)
       em_score=squad["exact_match"].item()
       f1_score=squad["f1"].item()
       idx+=1
    return {"exact_match" : em_score, "f1_score": f1_score}

def preprocess_logits_for_metrics(logits, labels):
        """Helper function for logits preprocessing for metrics"""
        preds = torch.argmax(logits, dim=-1)
        return preds, labels

training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=128,
    num_train_epochs=20,
    do_train=True,
    do_eval=True,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500000,
    bf16=True,
    output_dir="./test_predict",
    overwrite_output_dir=True,
    optim="adafactor",
    report_to="none",
    logging_steps=100000,
    remove_unused_columns=False,
    predict_with_generate=True,
    generation_config=gen_config)


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForGeneration(tokenizer),
    eval_data_collator=DataCollatorForGeneration(tokenizer, eval_mode=True),
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=custom_metrics,
)

trainer.train()

Here is my comments and question to think about:

  • Evaluation is very slow. I had to increase the batch size to 128. Usually with seq2seq trainer batch size after 8 would not affect the speed but in this example more batch size led to more inference and evaluation speed. This become a problem with XLA (e.g. TPU) with FSDP. The evaluation will freeze forever.
  • Is there any way to speed up the evaluation with pack=True?
  • I think we need to integrate the code to trl library including all trainer codes in trl repo (e.g, https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py)
  • The code need "report_to" to be assigned to none. Otherwise there will be problem with generation config format. The code you post accept the generation config in json format where the model.generation_config will be different (e.g, for Llama) .
  • We need to fed the prompt (e.g., instruction and prompt) to compute_metric or preprocess_logits_for_metrics in trainer and SFT trainer. This way we can split the completion (e.g. answer) from the whole sequence (e.g., instructing + Question).
  • In this code i did not use the "preprocess_logits_for_metrics" function. do you think doing some codes in preprocess_logits_for_metrics function can speed up the evaluation?

The output for this code is pretty decent :


 ======================================== Labels ========================================
Answer the following question: what does jamaican people speak?  Jamaican Creole English Language

Answer the following question: what did james k polk do before he was president?  Lawyer

Answer the following question: what is the oregon ducks 2012 football schedule?  University of Oregon

Answer the following question: who plays ken barlow in coronation street?  Tony Warren

Answer the following question: what happened after mr. sugihara died?  Yaotsu


 ======================================== Predictions ========================================
Answer the following question: what does jamaican people speak?  Jamaican Creole English Language#include<fstream.h>ndependent institution?

Answer the following question: what did james k polk do before he was president?  Lawyer(Redirected from James K. Polk

Answer the following question: what is the oregon ducks 2012 football schedule?  University of Oregon"""
University of Oregon Ducks football

Answer the following question: who plays ken barlow in coronation street?  Tony Warrendef乍

Answer the following question: what happened after mr. sugihara died?  Yaotsu``

Thank you

@zucchini-nlp
Copy link
Member Author

Thanks for feedback and testing the feature!

  • Hmm, let me see the speed issue, my guess is that generation in general will require more time as you are doing extra forward passes through the model. Increasing batch size makes it faster as now you generate 128 examples in one go. I will see if I can make generate faster, as we already do two forward passes, we could somehow use the past-key-values. Also will look into packing
  • Regarding TRL library, let's merge the PR in transformers first and we can port the changes to TRL if everyone is happy with the code
  • Are you reporting to WandB? I'll try to do report and see how to make it work
  • The question with prompts should be doable with current impl if we ask Trainer to include_inputs_for_metrics, iiuc that will let us see the prompt used by model and the whole completion generated

@zucchini-nlp
Copy link
Member Author

Seems like there's not much we can do about long evaluation time when generating. I tried to track how long it takes with decoder-only and encoder-decoder models. Indeed most encoder-decoder models are fast mainly because they're lightweight while the model you tried with is 8B parameters. I did several checks to verify that the evaluation speed is approximately same when we have models of similar size. Increasing batch size is one option to generate faster as you tried already. Another option is to generate only on a small sample of the eval set, and let users enable generation on the whole dataset if they want to

Also, logging on WandB is working for me and the generation config is logged as a simple dict, can you share what errors you got there @salrowili ?

@salrowili
Copy link

Hi @zucchini-nlp . When i state that the prediction is slow i compare it to this script here https://huggingface.co/docs/trl/en/sft_trainer, which is much faster. I think one possible way to solve this problem is to integrate your code to SFTTrainer class from trl repo and see if the speed has changed. Another way is to do it through eval_packing which will group couple of example together to fill the sequence. see : https://github.com/huggingface/trl/blob/314e8eb367cbfaf74c2e9717085346360e779508/trl/trainer/sft_trainer.py#L110 .

For wandb logging, to reproduce the error, just change report_to from None to wandb and you will get the error. but this issue is minor as we can overcome it by using wandb.init and wandb.log inside the code it self.

@zucchini-nlp
Copy link
Member Author

Oke, so it's SFTTrainer, then I'll see what is different there. For packing, we can calculate loss with packing but not generate, since generation tries to continue next several tokens from sequence and in a packed sequence there might be more than one.

In general, we had an idea to try out torch.nested_tensor which is the most similar to packing, but it won't be soon

@shubhamjain0594
Copy link

shubhamjain0594 commented Aug 28, 2024

Thanks for this PR @zucchini-nlp, hoping it gets merged soon. I am using similar thing internally to train decoder-only models for information extraction. I saw a concern that this is slower than traditional SFT Trainer and something I experienced as well.

My belief is that this might be mainly because in SFT Trainer, during prediction it predicts n+1 token using previous n tokens that come from prompt + ground truth. While in this case, previous n tokens come from prompt + predictions. So it cannot be parallelized same way as the SFT Trainer, where you can literally predict n+1, n+2... tokens in parallel.

My belief comes from the fact that I saw a drop in eval performance and increase in time when using predict_with_generate, compared to using SFTTrainer as it is.

for k, v in generation_inputs.items()
if k.replace("generation_", "") not in gen_keys and "generation" not in k
}
generated_tokens = self.model.generate(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use model instead of self.model here? In the evaluation_loop(), the self.model is wrapped and the wrapped model may not always be the same as self.model. I think this is for the case when deepspeed zero3 is enabled and evalute_on_start is set to true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inference we don't wrap for distributed mode, but I changed for model because there are some other steps run before returning the model. The original code was adapted from seq2seq trainer, so I modified it there too

# Note: in torch.distributed mode, there's no point in wrapping the model
# inside a DistributedDataParallel as we'll be under `no_grad` anyways.
if not training:
return model

@zucchini-nlp
Copy link
Member Author

@shubhamjain0594 Yes, that's exactly what I meant that generation is expected to take more time than simple forward. As per the last comment from @salrowili , I compared SFT and HFTrained, with and without generation. I don't see any slow down caused by HFTrainer specifically, as the both of them rely on the same code to do training and evalution. The only diff is that SFT support packed dataset while HFTrainer doesn't.

The current implementation of predict_with_generate doesn't support packing for reasons mentioned above. To be more specific generation cannot happen if we pack dataset, because we don't support it in transformers yet. The model generates the next token given all prev tokens, and packing would result in several prompts being merged thus bad quality generation.

I think we can let users use a small sample of the eval set for generation, if they don't want to slow down evaluation loop. Applying some generation optimization tricks here might not be the optimal solution, as we are trying to verify how good the model is learning. The only technique I can think of that can be used is torch compile, but it is still a very new feature and I would rather not integrate it in GFTrainer yet.

As per WandB, it worked for me in SFT and HFTrained, the generation config is logged as a dict in parameters

So, I think I can request review from @muellerz now. The PR isn't very high priority, so feel free to take a look whenever you have bandwidth

@zucchini-nlp zucchini-nlp requested a review from muellerzr August 30, 2024 10:52
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @zucchini-nlp! This PR looks very good. I left a few comments ! In order to simplify things, I'm thinking that it would be easier to just add a generation_kwargs, WDYT @muellerzr @zucchini-nlp ?

src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
run.py Outdated Show resolved Hide resolved
seq2seq.py Outdated Show resolved Hide resolved
Comment on lines +4232 to +4237
synced_gpus = gen_kwargs.get("synced_gpus", default_synced_gpus)
if len(gen_kwargs) > 0:
unused_kwargs = gen_config.update(**gen_kwargs)
if unused_kwargs:
logger.warning_once(
"Following generation related kwargs were passed to `prediction_step` but not "
Copy link
Member

@SunMarc SunMarc Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you pass synced_gpus in gen_kwargs, the warning will appear since it will be in unused_kwargs. Maybe do pop instead. Also this will trigger the warning in other places also.

test_dataset: Dataset,
ignore_keys: Optional[List[str]] = None,
metric_key_prefix: str = "test",
**gen_kwargs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to add it for prediction_step

Comment on lines 3929 to 3930
# Therefore, generation_config should be available
self.gen_config = self.model.generation_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also update the config with gen_kwargs ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean add kwargs from model.config to generation config? It shouldn't be necessary because the base model.generation_config should contain all generation related kwargs after the model is loaded. So we just need to make sure user-passed kwargs have higher priority than trainer.generation_config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking about the gen_kwargs that you are passing in predict. I would expect that self.gen_config is updated when the user pass gen_kwargs in the predict function in all cases (important in the case we pass a generate kwargs such as synced_gpus ). By default, it is equal to self.model.generation_config but if the user passes it in TrainingArguments, it will be equal to self.args.generation_config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now, right, we should be updating i any case

Comment on lines 3811 to 3827
# Set generation-related kwargs
if self.args.predict_with_generate:
if self.args.generation_config is not None:
gen_config = self.args.generation_config
self.gen_config = copy.deepcopy(gen_config) # copy so we don't modify args.gen_config in-place
unused_kwargs = self.gen_config.update(**gen_kwargs)
if unused_kwargs:
logger.warning_once(
f"Following generation related kwargs were passed to `evaluate` but not used by `generate()`: "
f"{' '.join(unused_kwargs.keys())} .",
"Make sure there are no typos in the passed kwargs or do not pass unused kwargs.",
)
else:
# We assume the model can generate if predict-with-generate is True
# Therefore, generation_config should be available
self.gen_config = self.model.generation_config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

Comment on lines 1523 to 1532
generation_config: Optional[GenerationConfig] = field(
default=None,
metadata={
"help": (
"The GenerationConfig that will be used during prediction. Args from this config ",
"will have higher priority than model's generation config. Anything not set by this config ",
"will fallback to `model.generation_config`.",
)
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we simplify a bit things if we also add a generation_kwargs as this is incompatible with generation_config + I don't think we want to merge both arguments into one. WDYT @muellerzr ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we can then allow users to pass generation_config as a dict also, then we can make a Config object of it ourselves. I see that TrainerSeq2Seq args also uses a config arg, so I thought we could later merge seq2seq args with trainerArgs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be better I think ! This way, we won't need to have **gen_kwargs in evaluate and predict function. cc @muellerzr @gante

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oke, now we can accept a dict or a config object in training args

@qiuosier
Copy link

qiuosier commented Aug 31, 2024

Hi @zucchini-nlp ! Thank you for adding this PR. I have been testing it and I have a few questions/thoughts:

  • When DeepSpeed zero3 is enabled, I am hitting the error mentioned in Bugfix for generation with an early-stopping process #32641 during the model.generate(). Hope we can merge that as well @SunMarc.
  • For the slower performance, I think this is expected since the new calls to model.gernerate() involves iterative decoding, which may have many forward passes. My understanding is, one forward pass for each new token, while the existing call to model() involves only single forward pass for each input sample.
  • For packing in SFT trainer, I think we can set packing=True and eval_packing=False to do packing for the training dataset only.
  • In your example code, I think you intended to get generation_input_ids from texts_eval (instead of texts). Otherwise, as we can see in the example outputs from @salrowili, every prediction includes the entire label.

@zucchini-nlp
Copy link
Member Author

@qiuosier Yes, in SFT one can pack train dataset and not pack the evaluation. I am not 100% sure it work with SFT out-of-the-box, since afair SFT doesn't accept the train_with_predict arg. We can work on adding SFT support after this PR is merged

  • In your example code, I think you intended to get generation_input_ids from texts_eval (instead of texts).

Oh, right, a typo hehe

@qiuosier
Copy link

qiuosier commented Sep 4, 2024

Hi @zucchini-nlp , the SFTConfig(args argument), is a subclass of TrainingArguments, so once your version of transformers is installed, we will be able to use the predict_with_generate in the args for in the SFTTrainer. I have to set dataset_kwargs={"skip_prepare_dataset": True} to customize the data preparation in the __init__() and use a customized data collator though. Because by default, the SFTTrainer will tokenize (and pack) the data in the __init__() call and the data collator is used for padding only. I think tokenizing the data in __init__() is more efficient as we only need to do it once, instead of doing it every epoch in the data collator.

@zucchini-nlp
Copy link
Member Author

@qiuosier oke, cool, then it might work out-of-the-box. Didn't really test it yet

@pdufour
Copy link

pdufour commented Nov 1, 2024

Any chance this can be merged?

@pdufour
Copy link

pdufour commented Nov 3, 2024

@zucchini-nlp I've been testing your branch and there were a couple issues that I fixed in regards to using it with SFTTrainer:

  1. [Draft] Add eval_data_collator arg trl#2311
  2. [Draft] Add autocast to prediction_step for SFTTrainer trl#2310

These can be landed after the transformers PR lands and a new version is released. I've tried to help you and re-merge in main into this branch, but when I made a PR it was not what I thought it'd be - zucchini-nlp#1. There weren't many changes though, just a couple conflicts. Let me know if there's anything I can do to help.

@zucchini-nlp
Copy link
Member Author

@pdufour hey! Thanks for opening PRs to integrate the feature in TRL. I agree that we should first get this merged in transformers.

For the progress, I am waiting for reviews from @muellerz . Can you review pls? 🤗 I will rebase main in a few hours

@R4ZZ3
Copy link

R4ZZ3 commented Nov 5, 2024

Hopefully this get's merged soon.
I thought it would be easy to implement some custom_metrics to calculate like mtbench scores at every eval step as previously something like bleu/wer were easy to calculate using just evaluate library

@tcz
Copy link

tcz commented Nov 26, 2024

So am I correct to assume that all existing collators, e.g. DataCollatorForCompletionOnlyLM will need to be modified so that they contain generation_input_ids and generation_attention_mask?

If so, it's less than ideal.

@tcz
Copy link

tcz commented Dec 12, 2024

While I hope this PR gets merged, here's a messy workaround I currently use:

import torch
from tqdm import tqdm

resume = False
for steps in tqdm(range(0, 1701, 100)):
    print(f"Steps: {steps}")

    if steps > 0:
        trainer = create_trainer(model, tokenizer, training_data['train'], steps)
        if resume:
            trainer.train(resume_from_checkpoint=True)
        else:
            trainer.train()
            resume = True
        
    model = FastLanguageModel.for_inference(model)

    test_prediction(model, training_data['test'], steps)

    model = FastLanguageModel.for_training(model)

I basically set max_steps on the trainer each time, set the model to eval mode, run my metrics and then resume_from_checkpoint with an increased number of steps. It's not very elegant, but it works for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants