-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
could not load weight to vllm after the first training step #132
Comments
@llliuxiao 这个问题应该是由zero3 TP引起的,看到有位老哥的做法是把zero3改成zero2, 不知道有没有维持zero3的解决方案 |
Wow! 请问下有没有相关的链接呢? @tcy6 |
@llliuxiao 在这里#81 |
我尝试使用了zero3_offload,也会出现相同的问题。使用zero2时会出现这个错误RuntimeError: torch.cat(): expected a non-empty list of Tensors。我没有修改这两个json文件的任何内容。 |
@llliuxiao 根据我的尝试,把以下部分删去就可以了
|
非常感谢!这种方式对于zero2是有效的! 但是似乎zero2的性能不如zero3,使用vllm、Qwen2-VL-2B 在40G的A100上还是会出现out of memory的问题。 |
Hi! I am training with vllm on 8 A100.
After getting the reward information of the first step, this error appears: "[rank0]: AssertionError: Attempted to load weight (torch.Size([0])) into parameter (torch.Size([1280, 3, 2, 14, 14]))".
The error comes from:
And here is part of my shell:
I was wondering if you've seen anything like that. Thanks!
The text was updated successfully, but these errors were encountered: