You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering a RuntimeError during training related to tensor size mismatch. Below is the traceback for the error
PyTorch version: 2.1.0+cu118
CUDA version: 11.8
Python 3.11.10
Ubuntu 18.04
Traceback (most recent call last): File "/media/path/SmartEdit-main/train/DS_MLLMSD11_train.py", line 712, in <module> train() File "/media/path/SmartEdit-main/train/DS_MLLMSD11_train.py", line 501, in train model_.load_pretrain_MLLM_alignment(SD_QFormer_conversation_33tokens=SD_QFormer_conversation_33tokens, LLaVA_00002=LLaVA_00002) File "/media/path/SmartEdit-main/model/DS_MLLMSD11_model.py", line 221, in load_pretrain_MLLM_alignment self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The expanded size of the tensor (35) must match the existing size (33) at non-singleton dimension 0. Target sizes: [35, 4096]. Tensor sizes: [33, 4096] [2024-11-01 11:52:23,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149061
The error occurs when trying to assign LLaMA_lm_haed tensor to self.lm_head.weight.data[-self.config.num_new_tokens:]. The size of num_new_tokens is set to 35, but LLaMA_lm_haed has a size of 33. This causes a dimension mismatch, resulting in the RuntimeError
The tensor sizes should match during assignment to prevent this error.
Run the training script with the following configuration
I'm encountering a RuntimeError during training related to tensor size mismatch. Below is the traceback for the error
PyTorch version: 2.1.0+cu118
CUDA version: 11.8
Python 3.11.10
Ubuntu 18.04
Traceback (most recent call last): File "/media/path/SmartEdit-main/train/DS_MLLMSD11_train.py", line 712, in <module> train() File "/media/path/SmartEdit-main/train/DS_MLLMSD11_train.py", line 501, in train model_.load_pretrain_MLLM_alignment(SD_QFormer_conversation_33tokens=SD_QFormer_conversation_33tokens, LLaVA_00002=LLaVA_00002) File "/media/path/SmartEdit-main/model/DS_MLLMSD11_model.py", line 221, in load_pretrain_MLLM_alignment self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The expanded size of the tensor (35) must match the existing size (33) at non-singleton dimension 0. Target sizes: [35, 4096]. Tensor sizes: [33, 4096] [2024-11-01 11:52:23,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149061
The error occurs when trying to assign LLaMA_lm_haed tensor to self.lm_head.weight.data[-self.config.num_new_tokens:]. The size of num_new_tokens is set to 35, but LLaMA_lm_haed has a size of 33. This causes a dimension mismatch, resulting in the RuntimeError
The tensor sizes should match during assignment to prevent this error.
Run the training script with the following configuration
`bash scripts/MLLMSD_7b.sh
wandb disabled
export WANDB_DISABLED=true
checkpoint-150000_embeddings_qformer.bin -> checkpoint-50000.bin
deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28457 train/DS_MLLMSD11_train.py
--max_steps 5000
--model_name_or_path ./checkpoints/vicuna-7b-v1-1
--LLaVA_00001 "./checkpoints/LLaVA-7B-v1/pytorch_model-00001-of-00002.bin"
--LLaVA_00002 "./checkpoints/LLaVA-7B-v1/pytorch_model-00002-of-00002.bin"
--LLaVA_model_path "./checkpoints/LLaVA-7B-v1"
--sd_qformer_version "v1.1-7b"
--unet_ckpt "./checkpoints/InstructDiffusion_diffusers/unet/diffusion_pytorch_model.bin"
--bf16 True
--tf32 True
--output_dir ./checkpoints/stage2_MLLMSD_7b
--num_train_epochs 20
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--evaluation_strategy 'no'
--save_strategy 'steps'
--save_steps 5000
--save_total_limit 3
--learning_rate 1e-5
--lr_scheduler_type 'cosine'
--weight_decay 0.
--warmup_ratio 0.001
--logging_steps 1
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 16
--ddp_find_unused_parameters True
--SD_QFormer_conversation_33tokens "./checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-50000.bin"
--InstructPix2PixDataset_path "./dataset/InstructPix2PixCLIPFiltered_HF"
--MagicBrushDataset_path "./dataset/MagicBrush_HF"
--LLaVADataset_data_path "./dataset/LLaVA/llava_instruct_150k.json"
--LLaVADataset_image_folder "./dataset/coco/train2017"
--refcoco_path "./dataset/refcoco"
--grefcoco_path "./dataset/grefcoco"
--coco_image_path "./dataset/coco"
--COCOStuff_mask_path "./dataset/cocostuff"
--ReasoningEditingDataset_path "./dataset/SyntheticData/SyntheticData_info_new.json"
--ReasoningSegmentationDataset_json_path "./dataset/reason_seg/train"
--ReasoningSegmentationDataset_image_path "./dataset/reason_seg/train"
--ReasoningSegmentationDataset_binary_mask_path "./dataset/reason_seg/train_binary_mask"
--deepspeed scripts/zero2_mixed.json `
How do you solve this problem?
The text was updated successfully, but these errors were encountered: