This is a re-implementation of Vahid Kazemi and Ali Elqursh's paper Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering in Pytorch based on Cyanogenoid's code.
-
preprocess-images.py: Change deprecated usage and rebuilt custom ResNet152 Loader without 'Pycaffe'
-
model.py : Despite Cyanogenoid focused on upgrading model's performance, I rebuilt model.py following the paper.
-
train.py : Now tensorboard is available, so tracker for loss and accuracy is unneeded. Also, use gradient clipping to prevent result from gradient exploding.
- install CoCo Datasets and set config.py's json and image file routes.
- preprocess images with 'python preprocess-images.py' command
- preprocess vocabulary (questions, answers) with 'python preprocss-vocab.py' command
- Run training and evaluating steps with 'python train.py' command
With 5 epochs, (1 epoch is about 2000 iters)
There is no merit for using my own testsets, I picked sample results with evaluation testsets.
left-above : question 1,2,3 left-below : question 4
- Is the food napping on the table?
- What has been to make lights?
- What is the table made of?
- Is this an Spanish town?
- no
- tea kettle
- wood
- no
- yes
- flowers
- wood
- yes
left-above : question 1,2 left-below : question 3,4
- What is in the top right corner?
- Are there shadows on the sidewalk?
- What is leaning against the house?
- Is it cold outside?
- tree
- yes
- ladder
- yes
- clock
- yes
- fire hydrant
- yes
left-above : question 1,2 left-below : question 3,4
- Is there a bicycle in this picture?
- How many windows can you see?
- Is the person feeding the birds?
- Is this in a park?
- yes
- 1
- no
- yes
- yes
- 3
- no
- yes
I trained my model for 24 hours, 5 epochs
- It shows good performance for 'yes/no' type questions
- But, when the question becomes subjective (choose between 3000 candidates), it shows lower performance