could not load weight to vllm after the first training step #132

llliuxiao · 2025-02-24T12:50:41Z

Hi! I am training with vllm on 8 A100.

After getting the reward information of the first step, this error appears: "[rank0]: AssertionError: Attempted to load weight (torch.Size([0])) into parameter (torch.Size([1280, 3, 2, 14, 14]))".

The error comes from:

 if self.accelerator.is_main_process:
     llm_model = (self.llm.llm_engine.model_executor.driver_worker.model_runner.mode)
     llm_model.load_weights(state_dict.items())

And here is part of my shell:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node="6" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
    src/open_r1/embodied_grpo.py \
    --output_dir "ckpts" \
    --model_name_or_path Qwen/Qwen2-VL-2B-Instruct \

I was wondering if you've seen anything like that. Thanks!

tcy6 · 2025-02-24T12:53:36Z

@llliuxiao 这个问题应该是由zero3 TP引起的，看到有位老哥的做法是把zero3改成zero2，不知道有没有维持zero3的解决方案

llliuxiao · 2025-02-24T14:16:57Z

@llliuxiao 这个问题应该是由zero3 TP引起的，看到有位老哥的做法是把zero3改成zero2，不知道有没有维持zero3的解决方案

Wow! 请问下有没有相关的链接呢？ @tcy6

tcy6 · 2025-02-24T14:31:05Z

@llliuxiao 在这里#81

llliuxiao · 2025-02-24T16:41:48Z

@llliuxiao在这里#81

我尝试使用了zero3_offload，也会出现相同的问题。使用zero2时会出现这个错误RuntimeError: torch.cat(): expected a non-empty list of Tensors。我没有修改这两个json文件的任何内容。

tcy6 · 2025-02-24T16:55:02Z

@llliuxiao 根据我的尝试，把以下部分删去就可以了

"optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

llliuxiao · 2025-02-24T17:12:16Z

@llliuxiao 根据我的尝试，把以下部分删去就可以了

"optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

非常感谢！这种方式对于zero2是有效的！

但是似乎zero2的性能不如zero3，使用vllm、Qwen2-VL-2B 在40G的A100上还是会出现out of memory的问题。

llliuxiao mentioned this issue Feb 24, 2025

What is the point for this check? #81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

could not load weight to vllm after the first training step #132

could not load weight to vllm after the first training step #132

llliuxiao commented Feb 24, 2025

tcy6 commented Feb 24, 2025

llliuxiao commented Feb 24, 2025 •

edited

Loading

tcy6 commented Feb 24, 2025

llliuxiao commented Feb 24, 2025

tcy6 commented Feb 24, 2025

llliuxiao commented Feb 24, 2025

could not load weight to vllm after the first training step #132

could not load weight to vllm after the first training step #132

Comments

llliuxiao commented Feb 24, 2025

tcy6 commented Feb 24, 2025

llliuxiao commented Feb 24, 2025 • edited Loading

tcy6 commented Feb 24, 2025

llliuxiao commented Feb 24, 2025

tcy6 commented Feb 24, 2025

llliuxiao commented Feb 24, 2025

llliuxiao commented Feb 24, 2025 •

edited

Loading