This repo will soon adapt to VQG task
Install the COCO Python API , for data preparation.
Given the VQA Dataset's annotations & questions file, generates a dataset file (.txt) in the following format:
image_name
\t question
\t answer
- image_name is the image file name from the COCO dataset
- question is a comma-separated sequence
- answer is a string (label)
Sample Execution:
$ python3 prepare_data.py --balanced_real_images -s val \
-a ./Data/raw/v2_mscoco_val2014_annotations.json \
-q ./Data/raw/v2_OpenEnded_mscoco_val2014_questions.json \
-o ./Data/processed/helper_val2014.txt \
-v ./Data/processed/vocab_count_5_K_1000.pickle -c 5 -K 1000 # vocab flags (for training set)
Stores the dataset file in the output directory -o
and the corresponding vocab file -v
.
For validation/test sets, remove the vocabulary flags: -v
, -c
, -K
.
The architecture can be summarized as:-
Image --> CNN_encoder --> image_embedding
Question --> LSTM_encoder --> question_embedding
(image_embedding * question_embedding) --> MLP_Classifier --> answer_logit
The architecture can be summarized as:-
Image --> CNN_encoder --> image_embedding
Question --> Word_Emb --> Phrase_Conv_MaxPool --> Sentence_LSTM --> question_embedding
ParallelCoAttention( image_embedding, question_embedding ) --> MLP_Classifier --> answer_logit
Run the following script for training:
$ python3 main.py --mode train --expt_name K_1000_Attn --expt_dir ./results_log \
--train_img ./Data/raw/train2014 --train_file./Data/processed/vqa_train2014.txt \
--val_img ./Data/raw/val2014 --val_file ./Data/processed/vqa_val2014.txt\
--vocab_file ./Data/processed/vocab_count_5_K_1000.pickle --save_interval 1000 \
--log_interval 100 --gpu_id 0 --num_epochs 50 --batch_size 160 -K 1000 -lr 1e-4 --opt_lvl 1 --num_workers 6 \
--run_name O1_wrk_6_bs_160 --model attention
Specify --model_ckpt
(filename.pth) to load model checkpoint from disk (resume training/inference)
Select the architecture by using --model
('baseline', 'attention').
Note: Setting num_cls (K) = 2 is equivalent to 'yes/no' setup.
For K > 2, it is an open-ended set.
- Baseline & HieCoAttn
- VQA w/ BERT
- Attention Visualization
[1] VQA: Visual Question Answering
[2] [Hierarchical Question-Image Co-Attention for Visual Question Answering]