This repository is for creating a machine learning model that manages to create natural language descriptions for line graphs.
This project uses two pre-existing datasets:
The FigureQA dataset is used to generate a synthetic dataset, while Chart-to-Text is used to generate a natural-language dataset.
To generate the synthetic dataset, place the downloaded folders to data/figureqa
(check the README there), and run
python3 src/synthetic/preprocess.py data/fiqureqa/X
Flags you can provide:
--unroll-descriptions
: By default, if a plot/figure/graph has more than 1 description, they are concatenated. This flag unrolls the descriptions, to createn
rows incaptions.csv
if there aren > 1
descriptions for a given graph.--replace-subjects
: Replaces subjects in descriptions. For exampleRed is greater than Blue
becomes<A> is greater than <B>
. This flag also addssubject_map
column tocaptions.csv
, so for every plot there is a JSON blob string that maps replacements to original subjects.--description-limit N
: Limits description length in sentences. Cannot be present together with--unroll-descriptions
--synthetic-config PATH_TO_FILE
: Provide a config file for custom question templates and desired question types. An example file is provided (synthetic.default.json
). For correct forms, checkquestion_to_description
insrc/synthetic/preprocess.py
. For question IDs, check keys inquestion_type_to_id
Dataset will be placed to data/processed_synthetic/X
.
To generate the synthetic "question types" dataset, place the downloaded folders to data/figureqa
(check the README there), and run
python3 src/synthetic/preprocess-question-types.py data/fiqureqa/X
Subject replacement (--replace-subjects
), unrolling (--unroll
) and synthetic config flag (--synthetic-config
) can be provided, as for generating the normal synthetic dataset.
To generate the natural-language dataset, place the downloaded folders to data/charttotext
(check the README there), and run
python3 src/natural/preprocess.py
Dataset will be placed to data/processed_natural
.