Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not load weight to vllm after the first training step #132

Open
llliuxiao opened this issue Feb 24, 2025 · 6 comments
Open

could not load weight to vllm after the first training step #132

llliuxiao opened this issue Feb 24, 2025 · 6 comments

Comments

@llliuxiao
Copy link

Hi! I am training with vllm on 8 A100.

After getting the reward information of the first step, this error appears: "[rank0]: AssertionError: Attempted to load weight (torch.Size([0])) into parameter (torch.Size([1280, 3, 2, 14, 14]))".

The error comes from:

 if self.accelerator.is_main_process:
     llm_model = (self.llm.llm_engine.model_executor.driver_worker.model_runner.mode)
     llm_model.load_weights(state_dict.items())

And here is part of my shell:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node="6" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
    src/open_r1/embodied_grpo.py \
    --output_dir "ckpts" \
    --model_name_or_path Qwen/Qwen2-VL-2B-Instruct \

I was wondering if you've seen anything like that. Thanks!

@tcy6
Copy link

tcy6 commented Feb 24, 2025

@llliuxiao 这个问题应该是由zero3 TP引起的,看到有位老哥的做法是把zero3改成zero2, 不知道有没有维持zero3的解决方案

@llliuxiao
Copy link
Author

llliuxiao commented Feb 24, 2025

@llliuxiao 这个问题应该是由zero3 TP引起的,看到有位老哥的做法是把zero3改成zero2, 不知道有没有维持zero3的解决方案

Wow! 请问下有没有相关的链接呢? @tcy6

@tcy6
Copy link

tcy6 commented Feb 24, 2025

@llliuxiao 在这里#81

@llliuxiao
Copy link
Author

@llliuxiao在这里#81

我尝试使用了zero3_offload,也会出现相同的问题。使用zero2时会出现这个错误RuntimeError: torch.cat(): expected a non-empty list of Tensors。我没有修改这两个json文件的任何内容。

@tcy6
Copy link

tcy6 commented Feb 24, 2025

@llliuxiao 根据我的尝试,把以下部分删去就可以了

"optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

@llliuxiao
Copy link
Author

@llliuxiao 根据我的尝试,把以下部分删去就可以了

"optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

非常感谢!这种方式对于zero2是有效的!

但是似乎zero2的性能不如zero3,使用vllm、Qwen2-VL-2B 在40G的A100上还是会出现out of memory的问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants