A python tool for building large scale Wikipedia-based Information Retrieval datasets
Currently supported languages:
English
French
Spanish
Italian
- Requirements
- Installation
- Usage
- Details
- Example
- Reproducibility
- Downloads
- More languages
- Citation
- References
- Python 3.6+
- NumPy and SciPy
- pytrec_eval to evaluate the runs
- nltk library to perform stemming and stopword removal in several languages
- Pandas library to be able to save the dataset as a dataframe compatible with MatchZoo
- Optional:
Install wikIR
git clone --recurse-submodules https://github.com/getalp/wikIR.git
cd wikIR
pip install -r requirements.txt
Install Rank-BM25 (optional)
pip install git+ssh://[email protected]/dorianbrown/rank_bm25.git
Install MatchZoo (optional)
git clone https://github.com/NTMC-Community/MatchZoo.git
cd MatchZoo
python setup.py install
- Download and extract a XML wikipedia dump file from here
- Use Wikiextractor to get the text of the wikipedia pages in a signle json file, for example :
python wikiextractor/WikiExtractor.py input --output - --bytes 100G --links --quiet --json > output.json
Where input is the XML wikipedia dump file and output is the output in json format
- Call our script
python build_wikIR.py [-i,--input] [-o,--output_dir]
[-m,--max_docs] [-d,--len_doc] [-q,--len_query] [-l,--min_len_doc]
[-e,--min_nb_rel_doc] [-v,--validation_part] [-t,--test_part]
[-k,--k] [-i,--title_queries] [-f,--only_first_links]
[-s,--skip_first_sentence] [-c,--lower_cased] [-j,--json]
[-x,--xml] [-b,--bm25] [-r,--random_seed]
arguments :
[-i,--input] The json file produced by wikiextractor
[-o,--output_dir] Directory where the collection will be stored
optional argument:
[--language] Language of the input json file
Possible values: 'en','fr','es','it'
Default value: 'en'
[-m,--max_docs] Maximum number of documents in the collection
Default value None
[-d,--len_doc] Number of max tokens in documents
Default value None: all tokens are preserved
[-q,--len_query] Number of max tokens in queries
Default value None: all tokens are preserved
[-l,--min_len_doc] Mininum number of tokens required for an article
to be added to the dataset as a document
Default value 200
[-e,--min_nb_rel_doc] Mininum number of relevant documents required for
a query to be added to the dataset
Default value 5
[-v,--validation_part] Number of queries in the validation set
Default value 1000
[-t,--test_part] Number of queries in the test set
Default value 1000
[-i,--title_queries] If used, queries are build using the title of
the article
If not used, queries are build using the first
sentence of the article
[-f,--only_first_links] If used, only the links in the first sentence of
articles will be used to build qrels
If not used, all links up to len_doc token will
be used to build qrels
[-s,--skip_first_sentence] If used, the first sentence of articles is not
used in documents
[-c,--lower_cased] If used, all characters are lowercase
[-j,--json] If used, documents and queries are saved in json
If not used, documents and queries are saved in
csv as dataframes compatible with matchzoo
[-x,--xml] If used, documents and queries are saved in xml
format compatible with Terrier IRS
If not used, documents and queries are saved in
csv as dataframes compatible with matchzoo
[-b,--bm25] If used, perform and save results of BM25 ranking
model on the collection
[-k,--k] If BM25 is used, indicates the number of documents
per query saved
Default value 100
[-r,--random_seed] Random seed
Default value 27355
- The data construction process is similar to [1] and [2]
- Article used to build the documents (article titles are removed from documents)
- Title or first sentence of each article is used to build the queries
- We assign a relevance of 2 if the query and document were extracted from the same article
- We assign a relevance of 1 if there is a link from the article of the document to the article of the query
- For example the document Autism is relevant to the query Developmental disorder.
Execute the follwing lines in the wikIR directory
Download the english wikipedia dump from 01/11/2019
wget https://dumps.wikimedia.org/enwiki/20191101/enwiki-20191101-pages-articles-multistream.xml.bz2
Extract the file
bzip2 -dk enwiki-20191101-pages-articles-multistream.xml.bz2
Use Wikiextractor (ignore the WARNING: Template errors in article)
python wikiextractor/WikiExtractor.py enwiki-20191101-pages-articles-multistream.xml --output - --bytes 100G --links --quiet --json > enwiki.json
Use wikIR builder
python build_wikIR.py --input enwiki.json --output_dir wikIR1k --max_docs 370000 -tfscb
rm enwiki-20191101-pages-articles-multistream.xml.bz2
rm enwiki-20191101-pages-articles-multistream.xml
rm wiki.json
Train and evaluate neural networks with Matchzoo
python wikIR/matchzoo_experiment.py -c config.json
Display results in a format compatible with a latex table
python wikIR/display_res.py -c config.json
The wikIR1k and wikIR59k datasets presented in our paper are available for download
You can download wikIR1k here.
You can download wikIR59k here.
Datasets in more languages are also available:
English | French | Spanish | Italian | |
---|---|---|---|---|
title queries | ENwikIR59k | FRwikIR14k | ESwikIR13k | ITwikIR16k |
first sentence queries | ENwikIRS59k | FRwikIRS14k | ESwikIRS13k | ITwikIRS16k |
We propose datasets with short and well defined queries built from titles (ENwikIR59k, FRwikIR14k) and datasets with long and noisy queries built from first sentences (ENwikIRS59k, FRwikIRS14k) to study robustness of IR models to noise.
To reproduce the wikIR1k dataset, execute the follwing lines in the wikIR directory
wget https://dumps.wikimedia.org/enwiki/20191101/enwiki-20191101-pages-articles-multistream.xml.bz2
bzip2 -d enwiki-20191101-pages-articles-multistream.xml.bz2
python wikiextractor/WikiExtractor.py enwiki-20191101-pages-articles-multistream.xml --output - --bytes 100G --links --quiet --json > enwiki.json
rm enwiki-20191101-pages-articles-multistream.xml.bz2
rm enwiki-20191101-pages-articles-multistream.xml
python build_wikIR.py --input enwiki.json --output_dir COLLECTION_PATH/wikIR1k --max_docs 370000 --validation_part 100 --test_part 100 -tfscb
rm enwiki.json
COLLECTION_PATH is the directory where wikIR1k will be stored
To reproduce the wikIR59k dataset, execute the follwing lines in the wikIR directory
wget https://dumps.wikimedia.org/enwiki/20191101/enwiki-20191101-pages-articles-multistream.xml.bz2
bzip2 -d enwiki-20191101-pages-articles-multistream.xml.bz2
python wikiextractor/WikiExtractor.py enwiki-20191101-pages-articles-multistream.xml --output - --bytes 100G --links --quiet --json > enwiki.json
rm enwiki-20191101-pages-articles-multistream.xml.bz2
rm enwiki-20191101-pages-articles-multistream.xml
python build_wikIR.py --input enwiki.json --output_dir COLLECTION_PATH/wikIR59k --validation_part 1000 --test_part 1000 -tfscb
rm enwiki.json
COLLECTION_PATH is the directory where wikIR59k will be stored
To create both wikIR1k and wikIR59k datasets just call the following script
./reproduce_datasets.sh COLLECTION_PATH
COLLECTION_PATH is the directory where the datasets will be stored
To reproduce our results with matchzoo models on the dev dataset, call
python matchzoo_experiment.py -c config.json
To compute statistical significance against BM25 with Student t-test with Bonferroni correction and display the results of the dev dataset, call
python display_res.py -c config.json
If you use wikIR tool or the dataset(s) we provide to produce results for your scientific publication, please refer to our paper:
@inproceedings{frej-etal-2020-wikir,
title = "{WIKIR}: A Python Toolkit for Building a Large-scale {W}ikipedia-based {E}nglish Information Retrieval Dataset",
author = "Frej, Jibril and
Schwab, Didier and
Chevallet, Jean-Pierre",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.237",
pages = "1926--1933",
abstract = "Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k: a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents) pairs.",
language = "English",
ISBN = "979-10-95546-34-4",
}
[1] Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-lingual learning-to-rank with shared representations, pdf
[2] Shigehiko Schamoni, Felix Hieber, Artem Sokolov, and Stefan Riezler. 2014. Learning translational and knowledge-based similarities from relevance rankings for cross-language retrieval, pdf