This package provides tools to download the WikiGraphs dataset [1], collected by pairing each Wikipedia article from WikiText-103 [2] with a knowledge graph (a subgraph from Freebase knowledge graph [3]). The baseline code to reproduce results in [1] is included as well. We hope this can spur more interest in developing models that can generate long text conditioned on graph and generate graphs given text.
Jax, Haiku, Optax, and Jraph are needed for this package. It has been developed and tested on python 3 with the following packages:
- Jax==0.2.13
- Haiku==0.0.5.dev
- Optax==0.0.6
- Jraph==0.0.1.dev
Other packages required can be installed via:
pip install -r requirements.txt
Note: you may need to use pip3
to select pip for python 3 and --user
option
to install the packages to avoid permission issues.
pip install -e .
You can download and unzip the data by running the following command:
bash scripts/download.sh
This will put the downloaded WikiText-103 data in a temporary directory
/tmp/data
with the tokenized WikiText-103 data in /tmp/data/wikitext-103
and
the raw data in /tmp/data/wikitext-103-raw
.
This script will also download our processed Freebase knowledge graph data in a
temporary directory /tmp/data/freebase
.
For WikiText-103, run the following command to generate a vocabulary file:
python scripts/build_vocab.py \
--vocab_file_path=/tmp/data/wikitext-vocab.csv \
--data_dir=/tmp/data/wikitext-103
You can change the default file paths but make sure you make them consistent.
You can run the following command to pair the Freebase graphs with WikiText-103 articles:
python scripts/freebase_preprocess.py \
--freebase_dir=/tmp/data/freebase/max256 \
--output_dir=/tmp/data/wikigraphs/max256
where the freebase_dir
/tmp/data/freebase/max256
is the directory that
contains the Fsreebase graphs, which should have files train.gz
, valid.gz
and test.gz
in it; and output_dir
is the directory that will contain the
generated paired Freebase-WikiText data.
Note: you may need to use python3
to select python 3 if you have both python 2
and 3 on your system.
Given that there are the following number of articles in WikiText-103:
Subset | #articles |
---|---|
Train | 28472* |
Valid | 60 |
Test | 60 |
*Official number is 28475 but we were only able to find 28472 articles in training set.
Our dataset covers around 80% of the WikiText articles:
Max graph size | 256 | 512 | 1024 |
---|---|---|---|
#articles in training set | 23431 | 23718 | 23760 |
Trainining set coverage | 82.3% | 83.3% | 83.5% |
#articles in validation set | 48 | 48 | 48 |
Validation set coverage | 80% | 80% | 80% |
#articles in test set | 43 | 43 | 43 |
Test set coverage | 71.7% | 71.7% | 71.7% |
You can build the vocabulary for the graph data (the max256 version) by running the following command:
python scripts/build_vocab.py \
--vocab_file_path=/tmp/data/graph-vocab.csv \
--data_dir=/tmp/data/wikigraphs \
--version=max256 \
--data_type=graph \
--threshold=15
This gives us a vocabulary of size 31,087, with each token included in the vocabulary appearing at least 15 times.
You also need to build a separate text vocabulary for the WikiGraphs data, as our training set does not cover 100% of WikiText-103.
python scripts/build_vocab.py \
--vocab_file_path=/tmp/data/text-vocab.csv \
--data_dir=/tmp/data/wikigraphs \
--version=max256 \
--data_type=text \
--threshold=3
Here we choose threshold 3 which is also used by the original WikiText-103 data, this gives us a vocabulary size of 238,068, only slightly smaller than the original vocabulary size.
Note that when loading these vocabularies to build tokenizers, our tokenizers
will add a few extra tokens, like <bos>
, <pad>
, so the final vocab size
might be slightly different from the numbers above, depending on which tokenizer
you choose to use.
We only showcase how to build the vocabulary for the max256 version. The above steps can be easily changed for the max512 and max1024 version.
We provide JAX modules to load the WikiGraphs dataset. There are three classes
in wikigraphs/data/paired_dataset.py
:
TextOnlyDataset
: loads only the text part of the WikiGraphs dataBow2TextDataset
: loads text and the paired graph representated as one big bag-of-words (BoW) on all nodes and edges from the graphGraph2TextDataset
: returns text and the paired graph in which each node or edge is represented by a BoW
Different versions of the dataset can be accessed by changing the version
argument in each class. For more detailed usage please refer to
wikigraphs/data/paired_dataset_test.py
. Besides, the original WikiText dataset
can be loaded via the Dataset
class in wikigraphs/data/wikitext.py
.
Note: you may want to change the default data directory if you prefer to place it elsewhere.
To quickly test-run a small model with 1 GPU:
python main.py --model_type=graph2text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/graph2text \
--job_mode=train \
--train_batch_size=2 \
--gnn_num_layers=1 \
--num_gpus=1
To run the default baseline unconditional TransformerXL on Wikigraphs with 8 GPUs:
python main.py --model_type=text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/text \
--job_mode=train \
--train_batch_size=64 \
--gnn_num_layers=1 \
--num_gpus=8
To run the default baseline BoW-based TransformerXL on Wikigraphs with 8 GPUs:
python main.py --model_type=bow2text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/bow2text \
--job_mode=train \
--train_batch_size=64 \
--gnn_num_layers=1 \
--num_gpus=8
To run the default baseline Nodes-only GNN-based TransformerXL on Wikigraphs with 8 GPUs:
python main.py --model_type=bow2text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/bow2text \
--job_mode=train \
--train_batch_size=64 \
--gnn_num_layers=0 \
--num_gpus=8
To run the default baseline GNN-based TransformerXL on Wikigraphs with 8 GPUs:
python main.py --model_type=graph2text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/graph2text \
--job_mode=train \
--train_batch_size=64 \
--gnn_num_layers=1 \
--num_gpus=8
We ran our experiments in the paper using 8 Nvidia V100 GPUs. Reduce the batch size if the model does not fit into memory. To allow for batch parallization for the GNN-based (graph2text) model, we pad graphs to the largest graph in the batch. The full run takes almost 4 days. BoW- and nodes-based models can be trained within 14 hours because there is no additional padding.
To evaluate the model on the validation set (this only uses 1 GPU):
python main.py --model_type=graph2text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/graph2text \
--job_mode=eval \
--eval_subset=valid
To generate 960 samples from the model using the graphs in the validation set (using 8 GPUs):
python main.py --model_type=graph2text \
--dataset=freebase2wikitext \
--checkpoint_dir=/tmp/graph2text \
--job_mode=sample \
--eval_subset=valid \
--num_gpus=8 \
--num_samples=960
To compute the rBLEU score of the generated samples:
python scripts/compute_bleu_score.py --dataset=freebase2wikitext \
--checkpoint_dir=/tmp/graph2text
To compute the retrieval scores:
python main.py --dataset=freebase2wikitext \
--job_mode=retrieve \
--checkpoint_dir=/tmp/graph2text
To cite this work:
@inproceedings{wang2021wikigraphs,
title={WikiGraphs: A Wikipedia Text-Knowledge Graph Paired Dataset},
author={Wang, Luyu and Li, Yujia and Aslan, Ozlem and Vinyals, Oriol},
booktitle={Proceedings of the Graph-Based Methods for Natural Language Processing (TextGraphs)},
pages={67--82},
year={2021}
}
All code copyright 2021 DeepMind Technologies Limited
Code is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
WikiGraphs [1] is licensed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
WikiText-103 data [2] (unchanged) is licensed by Salesforce.com, Inc. under the terms of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. You can find details about CC BY-SA 4.0 at:
https://creativecommons.org/licenses/by-sa/4.0/legalcode
Freebase data [3] is licensed by Google LLC under the terms of the Creative Commons CC BY 4.0 license. You may obtain a copy of the License at:
https://creativecommons.org/licenses/by/4.0/legalcode
- L. Wang, Y. Li, O. Aslan, and O. Vinyals, "WikiGraphs: a wikipedia - knowledge graph paired dataset", in Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), pages 67-82, 2021.
- S. Merity, C. Xiong, J. Bradbury, and R. Socher, "Pointer sentinel mixture models", arXiv: 1609.07843, 2016.
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, "Freebase: a collaboratively created graph database for structuring human knowledge", in Proceedings of ACM SIGMOD international conference on Managementof data, pages 1247–1250, 2008.