-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch input type and weight type when training with precision fp16 #260
Comments
Thanks for bringing this up! I will take a closer look later today. I do want to point out that we haven't gotten good performance with pure fp16 training. It could be more better if you use fp32 but use fsdp to shard model state across your GPUs rather than reducing the precision. |
Thanks for clarifying. FSDP would be ideal. Still, I have problems training with FSDP. Namely, I am using MPT-1B and it does not have the |
Got it. There is this version of mpt I use for testing if you want to give fsdp a shot before the new refactor is merged. |
Great, thanks for bringing up this. I will give it a try on this model with fsdp. |
I tried fsdp with "mpt-1b-redpajama-200b-hf-style" and it could pass the above error. However, I get another error where the shape of input embeddings (self.transformer.wte.weight) has been altered. I believe it should be a 2-D tensor of shape (:, 2048) instead of a 1-D tensor of shape (25743360) which causes the size mismatch when computing the logits. More details below:
|
did you resolve this? i get a very similar error while trying to use fsdp w/ openflamingo 9B:
|
related: #129 (comment) |
Hi, thanks for making this project public.
I am trying to run training with fp16 and get the following error:
I am able to run using fp32 successfully only with an OOM error.
Traceback for error when using fp16:
Environment
I am using python 3.9.17 with V100 GPUs.
The text was updated successfully, but these errors were encountered: