Universal Schema embeds text and knowledge base relations together to perform relation extraction and automatic knowledge base population. The typical universal schema model performs matrix factorization where rows are entity pairs and columns are relations.
This code allows you to perform matrix factorization where the column and row embeddings are parameterized by an arbitrary encoder. In the simplest case, a 'standard' matrix factorization would have each encoder as a lookup-table. More complex models could use combinations of LSTMs, CNNs, etc. To do this is as simple as setting the rowEncoder and colEnoder parameters
th src/UniversalSchema.lua -rowEncoder lookup-table -colEncoder lstm
This code was used for the following papers
Multilingual Relation Extraction using Compositional Universal Schema by Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, Andrew McCallum. NAACL 2016. (You can download our training data here and some pretrained models here)
Row-less Universal Schema Patrick Verga and Andrew McCallum. AKBC 2016 (For this code, use the rowless-updates branch)
Generalizing to Unseen Entities and Entity Pairs with Row-less Universal Schema by Patrick Verga, Arvind Neelakantan, Andrew McCallum. EACL 2017. (For this code, use the rowless-updates branch. You can download our entity type data here )
If you use this code, please cite us.
- Lua 5.1 (may not work with newer versions)
- torch
- nn
- rnn
- optim
- set this environment variable : TH_RELEX_ROOT=/path/to/this/proj
Your entity-relation data should be 4 col tsv.
entity1 \t entity2 \t relation \t 1
./bin/process/process-data.sh -i your-data -o your-data.torch -v vocab-file
There are other flags in you can look at by doing ./bin/process/process-data.sh --help
You can also process arbitrary data in 3 column format with the -b flag
row_value \t col_value \t 1
If you want your rows and columns to share the same vocabulary, use the -g flag
./bin/process/process-data.sh -i your-3column-data -o your-data.torch -v vocab-file -b -g
You can run various Universal Schema models located in src. Check out the various options in CmdArgs.lua
You can train models using this train script. The script takes two parameters, a gpuid (-1 for cpu) and a config file. You can run an example base Universal Schema model and evaluate MAP with the following command.
./bin/train/train-model.sh 0 bin/train/configs/examples/uschema-example
MAP will be calculated every kth iteration based on the -evaluateFrequency cmd arg. AP is calculated on a per-column basis and then averaged to get MAP. To calculate MAP for your model, you need to generate one file per test column in the same format as your test data. Unlike the training data, in the test data you need to explicitly give negative examples. Negative samples should just have a 0 in the last column of the file while positive examples have a 1.
Place all of these files in a directory, test-data-dir for example, and then run the following command:
./bin/process/process-test-data-dir.sh test-data-dir test-data-dir.torch vocab-file
Here vocab-file should be the same vocab file that you generated your training data with.
- This requires setting up Relation Factory and setting $TAC_ROOT=/path/to/relation-factory. Just follow the setup instructions on the relation factory github or run
$TH_RELEX_ROOT/setup-relationfactory.sh
.
First run :./setup-tac-eval.sh
We include candidate files for years 2012, 2013, and 2014 as well as config files to evaluate each year.
You can tune thresholds on year 2012 and evaluate on year 2013 with this command :
./bin/tac-evaluation/tune-and-score.sh 2012 2013 trained-model vocab-file.txt gpu-id max-length-seq-to-consider output-dir
You can also download some pretrained models from our paper Multilingual Relation Extraction using Compositional Universal Schema. The download includes a script that will evaluate the models.
You can also use this code to score relations. Here we'll walk through the steps to train a universal schema model.
e1 | e2 | relation | 1 |
---|---|---|---|
/m/02k__v | /m/01y5zy | $ARG1 lives in the city of $ARG2 | 1 |
/m/09cg6 | /m/0r297 | $ARG2 is a type of $ARG1 | 1 |
/m/02mwx2g | /m/02lmm0_ | /biology/gene_group_membership/gene | 1 |
/m/0hqv6zr | /m/0hqx04q | /medicine/drug_formulation/formulation_of | 1 |
/m/011zd3 | /m/02jknp | /people/person/profession | 1 |
- First create a training set that combines KB triples that you care about as well as text relations you care about. For example generate a file like the one above called train.tsv.
- Next, process that file :
./bin/process/process-data.sh -i train.tsv -o data/train.torch -v vocab-file
- Now we want to train a model. Edit the example lstm config to say
export TRAIN_FILE=train-mtx.torch
and start the model training :./bin/train/train-model.sh 0 bin/train/configs/examples/lstm-example
. This will save a model to models/lstm-example/*-model every 3 epochs. - Now we can use this model to perform relation extraction. Generate a candidate file called candidates.tsv. The file should be tab serparated with the following form :
entity_1 kb_relation entity_2 doc_info arg1_start_token_idx arg1_send_token_idx arg2_start_token_idx arg2_end_token_idx sentence.
A concrete example is :
Barack Obama per:spouse Michelle Obama doc_info 0 2 8 10 Barack Obama was seen yesterday with his wife Michelle Obama . - Finally, we can score each relation with the following command
th src/eval/ScoreCandidateFile.lua -candidates candidates.tsv -outFile scored-candidates.tsv -vocabFile vocab-file-tokens.txt -model models/lstm-example/5-model -gpuid 0
This will generate a scored candidate file with the same number of lines and the sentenece replaced by a score where higher is more probable.
Barack Obama per:spouse Michelle Obama doc_info 0 2 8 10 0.94 .
Feel free to contact me with questions : [email protected]