Note: Our pre-training dataset are inherited from MDETR paper, you can also refer to their pre-training instructions to prepare the dataset and potentially train the model with other datasets.
- Download the original Flickr30k image dataset from Flickr30K webpage and update the flickr_img_path to the folder containing the images.
- Download the original Flickr30k entities annotations from Flickr30k annotations and update the flickr_dataset_path to the folder with annotations.
- Download MDETR's pre-processed annotations from this link and update the flickr_ann_path to this folder with pre-processed annotations.
We trained our model with 8 Nvidia A40 GPUs, each with 48GB of VRAM. The pre-training process took 3 days to finish with a step number of 150k.
The pre-training process is logged with WandB.
The pretrain_flickr.sh
script contains the exact command we used to pre-train our model.
bash scripts/pretrain/pretrain_w2bert.sh
Note:
- For analysis purpose, we also saved a series of checkpoints during the pre-training process. If you don't want to save the checkpoints, you can remove the
--save_for_aoa
flag from the script.