GitHub - likenneth/mmgnn_textvqa: A Pytorch implementation of CVPR 2020 paper: Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal GNN for TextVQA

This project provides codes to reproduce the results of Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text on TextVQA dataset
We are grateful to MMF (or Pythia), an excellent VQA codebase provided by Facebook, on which our codes are developed
We achieved 32.46% accuracy (ensemble) on test set of TextVQA

Requirements

Pytorch 1.0.1 post
We have performed experiments on Maxwell Titan X GPU, which has 12GB of GPU memory
See requirements.txt for the required python packages and run to install them

Let's begin from cloning this repository

$ git clone https://github.com/ricolike/mmgnn-textvqa.git
$ cd mmgnn-textvqa
$ pip install -r requirements.txt

Data Setup

cached data: To boost data loading speed under limited memory size (64G) and to speed up calculation, we cached intermediate dataloader results in storage. Download data (around 54G, and around 120G unzipped), and modify line 11 (fast_dir) in config to the absolute path where you save them
other files: Download other needed files (vocabulary, OCRs, some parameters of backbone) here (less than 1G), and make a soft link named data under repo root towards where you saved them

Training

Create a new model folder under ensemble, say foo, and then copy our config into it

$ mkdir -p ensemble/foo
$ cp ./configs/vqa/textvqa/s_mmgnn.yml ./ensemble/foo

Start training, and parameters will be saved in ensemble/foo

$ python tools/run.py --tasks vqa --datasets textvqa --model s_mmgnn --config ensemble/foo/s_mmgnn.yml -dev cuda:0 --run_type train`

First-run of this repo will automatically download glove in pythia/.vector_cache, let's be patient. If we made it, we will find a s_mmgnnbar_final.pth in the model folder ensemble/foo

Inference

If you want to skip training procedure, a trained model is provided on which we can directly do inference
Start inference by running the following command. And if you made it, you will find three new files generated under the model folder, two ends with _evailai.p are ready to be submitted to evalai to check the results

$ python tools/run.py --tasks vqa --datasets textvqa --model s_mmgnn --config ensemble/bar/s_mmgnn.yml --resume_file <path_to_pth> -dev cuda:0 --run_type all_in_one

Bibtex

@inproceedings{gao2020multi,
  title={Multi-modal graph neural network for joint reasoning on vision and scene text},
  author={Gao, Difei and Li, Ke and Wang, Ruiping and Shan, Shiguang and Chen, Xilin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={12746--12756},
  year={2020}
}

An attention visualization

Question: "What is the name of the bread sold at the place?"
Answer: "Panera"
(where white box is the answer predicted, green boxes are OCRs Panera attends to, and red boxes are visual ROIs Panera attends to; box weight indicating attention strength)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.circleci		.circleci
.idea		.idea
configs/vqa/textvqa		configs/vqa/textvqa
pics		pics
pythia		pythia
tests		tests
tools		tools
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal GNN for TextVQA

Requirements

Data Setup

Training

Inference

Bibtex

An attention visualization

About

Releases

Packages

Contributors 2

Languages

License

likenneth/mmgnn_textvqa

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal GNN for TextVQA

Requirements

Data Setup

Training

Inference

Bibtex

An attention visualization

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages