line-chart-captioning

This repository is for creating a machine learning model that manages to create natural language descriptions for line graphs.

Dataset

This project uses two pre-existing datasets:

FigureQA (download link)
Chart-to-Text (github link)

The FigureQA dataset is used to generate a synthetic dataset, while Chart-to-Text is used to generate a natural-language dataset.

Generating synthetic dataset

To generate the synthetic dataset, place the downloaded folders to data/figureqa (check the README there), and run

python3 src/synthetic/preprocess.py data/fiqureqa/X

Flags you can provide:

--unroll-descriptions: By default, if a plot/figure/graph has more than 1 description, they are concatenated. This flag unrolls the descriptions, to create n rows in captions.csv if there are n > 1 descriptions for a given graph.
--replace-subjects: Replaces subjects in descriptions. For example Red is greater than Blue becomes <A> is greater than <B>. This flag also adds subject_map column to captions.csv, so for every plot there is a JSON blob string that maps replacements to original subjects.
--description-limit N: Limits description length in sentences. Cannot be present together with --unroll-descriptions
--synthetic-config PATH_TO_FILE: Provide a config file for custom question templates and desired question types. An example file is provided (synthetic.default.json). For correct forms, check question_to_description in src/synthetic/preprocess.py. For question IDs, check keys in question_type_to_id

Dataset will be placed to data/processed_synthetic/X.

Generating synthetic "question-types" dataset

To generate the synthetic "question types" dataset, place the downloaded folders to data/figureqa (check the README there), and run

python3 src/synthetic/preprocess-question-types.py data/fiqureqa/X

Subject replacement (--replace-subjects), unrolling (--unroll) and synthetic config flag (--synthetic-config) can be provided, as for generating the normal synthetic dataset.

Generating natural-language dataset

To generate the natural-language dataset, place the downloaded folders to data/charttotext (check the README there), and run

python3 src/natural/preprocess.py

Dataset will be placed to data/processed_natural.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
synthetic.default.json		synthetic.default.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

line-chart-captioning

Dataset

Generating synthetic dataset

Generating synthetic "question-types" dataset

Generating natural-language dataset

About

Releases

Packages

Languages

License

snemvalts/line-chart-captioning

Folders and files

Latest commit

History

Repository files navigation

line-chart-captioning

Dataset

Generating synthetic dataset

Generating synthetic "question-types" dataset

Generating natural-language dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages