Fix and update vllm-based GRPO Trainer implementation #85

FUJIsyu0515 · 2025-02-13T15:35:43Z

In order to fix the problems in the original vllm-based grpo trainer implementation, a new trainer Qwen2VLGRPOVLLMTrainerModified is provided (located in src/open-r1-multimodal/src/open_r1/trainer/vllm_grpo_trainer_modified.py).

It no longer uses RepeatRandomSampler to avoid the issue of training steps doubling. Instead, it completes multiple sampling and loss calculations for each prompt within a single original batch, maintaining consistency with the logic of Qwen2VLGRPOTrainer. Additionally, it no longer requires num_generations to match the number of GPUs.

The new Trainer has replaced the original Qwen2VLGRPOVLLMTrainer in src/open-r1-multimodal/src/open_r1/grpo.py.
The vllm sampling logic has been corrected.
Multi-machine implementation is not yet completed (TODO).

TobiasLee · 2025-02-14T00:20:38Z

src/open-r1-multimodal/src/open_r1/trainer/vllm_grpo_trainer_modified.py

+            # group into pairs
+            all_multimodal_inputs = []
+            for prompt, image in zip(all_prompts_text, all_images):
+                for _ in range(self.num_generations):


we can utilize the n sampling params in vLLM to avoid this for loop?

https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py#L97C75-L100C70

TobiasLee · 2025-02-14T00:23:56Z

here is my idea of avoiding for loop to repeat n_gen (which might be slow when num_generation becomes large compared to direct repeat sampler:

        if self.args.use_vllm:
            # previous code remains the same 

            # Generate completions using vLLM: gather all prompts and use them in a single call
            all_prompts_text = gather_object(prompts_text)
            all_images = gather_object(images)

            # prepare all inputs = global batch size 
            all_multimodal_inputs = [{"prompt": prompt, "multi_modal_data": {"image": image}} for prompt, image in zip(all_prompts_text, all_images)]

            # Create sampling params with num_generations
            if self.accelerator.is_main_process:
                # Clone to avoid modifying original params
                sampling_params = self.sampling_params.copy()
                sampling_params.n = self.num_generations
            else:
                sampling_params = None

            # Single generate call with all prompts
            if self.accelerator.is_main_process:
                outputs = self.llm.generate(
                    all_multimodal_inputs,
                    sampling_params=sampling_params,
                    use_tqdm=False,
                )
                # Flatten outputs: [prompt1_gen1, prompt1_gen2..., prompt2_gen1, prompt2_gen2...]
                completion_ids = [out.token_ids for completion in outputs for out in completion.outputs]
            else:
                completion_ids = [None] * len(all_multimodal_inputs) * self.num_generations

            # [Keep the broadcasting and slicing logic unchanged...]
            completion_ids = broadcast_object_list(completion_ids, from_process=0)
            process_slice = slice(
                self.accelerator.process_index * len(prompts) * self.num_generations,
                (self.accelerator.process_index + 1) * len(prompts) * self.num_generations,
            )
            completion_ids = completion_ids[process_slice]

            # [Keep the padding and concatenation logic unchanged...]

stephenruan added 2 commits February 13, 2025 23:02

Update GRPO Trainer implementation based on vllm.

ec55d3b

Align to main branch.

294ac17

TobiasLee reviewed Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and update vllm-based GRPO Trainer implementation #85

Fix and update vllm-based GRPO Trainer implementation #85

FUJIsyu0515 commented Feb 13, 2025

TobiasLee Feb 14, 2025

TobiasLee Feb 14, 2025

TobiasLee commented Feb 14, 2025

Fix and update vllm-based GRPO Trainer implementation #85

Are you sure you want to change the base?

Fix and update vllm-based GRPO Trainer implementation #85

Conversation

FUJIsyu0515 commented Feb 13, 2025

TobiasLee Feb 14, 2025

Choose a reason for hiding this comment

TobiasLee Feb 14, 2025

Choose a reason for hiding this comment

TobiasLee commented Feb 14, 2025