MELArt. A Multimodal Entity Linking Dataset for Art.

Code for the generation of MELArt. A Multimodal Entity Linking Dataset for Art.

The code for the experiments with the baselines and for generating model-specific versions od the dataset can be found here

Pre-requisites

Create a .env file (you can use .env_sample as a tempate) and set the access token for Wikimedia API and the user agent.
Install required libraries. The easiest way is to use the provided conda environment environment.yaml
Install spacy English model python -m spacy download en_core_web_sm

Download the Artpedia dataset from https://aimagelab.ing.unimore.it/imagelab/page.asp?IdPage=35 and place the artpedia.json file in the input_files/ folder.
Download the Wikidata dump latest-all.json.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/ the dump from 2023-03-22 was used to generate MELArt. The file should be put into or linked from input_files/.

You can also avoid having the input_files/ folder, by adjusting the paths in the paths.py script.

Execute the following scripts to generate the dataset.

art_merging.py: It matches Artpedia paintings to Wikidata entities using the Wikipedia title, and extracts painitng information from Wikidata.
get_artpedia_depicted.sh: This script first creates a text file to filter the Wikidata dump by qid, and then creates a ndjson file with the information from the depicted entities from the Wikidata dump.
text_matcher.py: Matches the labels of the depcited entities in the visual and contextual sentences.
get_candidates.py: Get the candidates for the depicted entities in the visual and contextual sentences, using Wikidata search API. For each candidates, it creates a json file with its information in the aux_files/el_candidates folder. Also creates a file with the list of entity image paths.
get_candidate_types.py: Reads each candidate information, builds a set of all the types, and uses Wikidata's API to get the type labels. The types information is stored in the aux_files/candidate_types_dict.json file.
crawl_images.py: crawl the images from Wikimedia Commons basd on the imgs_url.txt file (from get_candidates.py)
filter_candidate_images: Removes the candidate images that correspond to the paintings in MELArt.
combine_curated_annotations: This script combines the automatically generated annotations, with the manually curated annotations to produce the final dataset in the output_files/melart_annotations.json file.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
input_files		input_files
.env_sample		.env_sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
art_merging.py		art_merging.py
combine_curated_annotations.py		combine_curated_annotations.py
crawl_images.py		crawl_images.py
environment.yaml		environment.yaml
filter_candidate_images.py		filter_candidate_images.py
get_artpedia_depicted.sh		get_artpedia_depicted.sh
get_artpedia_depicted_ids.py		get_artpedia_depicted_ids.py
get_candidate_types.py		get_candidate_types.py
get_candidates.py		get_candidates.py
paths.py		paths.py
text_matcher.py		text_matcher.py