Example dataset
{"qid": 1, "image_name": "synpic54610.jpg", "image_organ": "HEAD", "answer": "Yes", "answer_type": "CLOSED", "question_type": "PRES", "question": "Are regions of the brain infarcted?", "phrase_type": "freeform"}
VQA_RAD Image Folder contains the images in .jpg format, whose name are the qid as mentioned in the trainset.json.
A pretained model is used for word to vector conversion, Google News Vector for vectorizing all the questions.
atoi (ASCII to Integers) is used to first convert the word into integers, such as 0 for 'yes' 1 for 'no' and so on to get ready to fit into the model.
itoa (Integers to ASCII) is used to map each integers to its respective word.
VGG16 pretrained on imagenet dataset is used for image preprocessing.
The VGG16 Architecture
We use Dense network with tanh activation for preprocessed images.
The question layer is passed through LSTM.
After that both the vectors are concatenated and passed through dense layers, and final layer with softmax function.
The built model looks something like this: