-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454
Comments
@younesbelkada you're the author of that sample notebook and the keeper of the football dataset on Hugging Face - any idea what might be causing the loss to go to nan? |
@z3ugma the same problem. I started to get the nan loss in the 2nd batch of epoch 0. Have you solved it? |
I found an interesting thing. On Google Colab, the loss will not change to nan. It seems that there are still differences between Colab and the local notebook. |
I also ran into this issue recently with finetuning BLIP2, whereas it was working before. I haven't had a chance to pin it down, but it might be a package version issue with something introducing a breaking change? |
Rolling back to peft=0.5.0 was able to get the blip2 example working for me |
@jeffliu-LL which pytorch version are you using? |
pytorch 2.0.1 with pytorch-cuda 11.8 |
I will try rolling back to peft 0.5 with cuda 12.2 and Python 3.11. Will report back |
No, still a problem:
|
Still shows all
|
@jeffliu-LL will you put your versions of Python, PyTorch, Transformers, and cuda from the working environment? |
Here are the packages from the working Google Colab environment:
|
Still not working on Python 3.10
|
seems like we meet the same problem :( OS: Windows 10 Torch 2.1.2+cu118 |
@z3ugma I meet the similar issue.The loss change to nan after epoch 0. Have you fix it? |
@wushandinghua no, I've not yet had success |
Any solutions? have same problem |
I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you. |
change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32) still has same problem |
Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. |
Sorry for the late reply.
|
I test with the same code as the https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb on my computer. |
Okay, I will deploy it to my own PyCharm for experimentation |
peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb
This notebook:
https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb
Trains fine on Google Colab at https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=upI97XEH6EKe
using Python 3.10.12, Torch 2.1.0
It does not train on my workstation - the loss collapses to NaN after just a few epochs:
My workstation is also using Python 3.10.13, and Torch 2.1.0. What could be causing the loss to be all nan?
The text was updated successfully, but these errors were encountered: