Quantifiers and monotonicity (and opposite adjectives) in reasoning tasks
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
source .venv/bin/activate
# Finetuning
python -m sarn.train --output-dir models/bart-mq --log-dir logs/bart-mq facebook/bart-large-mnli data/training.csv
# Inference of two sequences (forwards)
python -m sarn.classify --model models/bart-mq "All dogs jumped over the fence." "Some small dogs jumped over the fence."
# ROC curve (SVG and PDF diagram)
python -m sarn.roc microsoft/deberta-large-mnli data/evaluation.csv
# Model accuracy on dataset
python -m sarn.accuracy models/deberta-mq data/evaluation.csv
# Dataset statistics
python -m sarn.stats data/training.csv
# Language Interpretability Tool
python -m sarn.lit \
--models "facebook/bart-large-mnli" \
"microsoft/deberta-large-mnli" \
"./models/bart-mq" \
"./models/deberta-mq" \
"./models/bart-adj" \
"./models/deberta-adj" \
--datasets "./data/evaluation.csv" "./data/evaluation-adj.csv" \
--cache_dir=cache_dir
As model, any valid Huggingface model (local or remote) can be specified that has been finetuned for sequence classification, e.g., facebook/bart-large-mnli
, microsoft/deberta-large-mnli
or a local path like models/bart-mq
.
# export COLI_USER=<your name>
scp -r ${COLI_USER:?}@last.cl.uni-heidelberg.de:/mnt/semproj/sem_proj20/proj1/models .
for i in ./data/*.csv; do
python3 -m sarn.validate_datasets "$i"
done
See data/README.md
for more information.
data/training.csv
: Training dataset for quantifiers and monotonicity in reasoning tasksdata/evaluation.csv
: Evaluation dataset for quantifiers and monotonicity in reasoning tasksdata/training.csv
: Training dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectivesdata/evaluation.csv
: Evaluation dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives
source .venv/bin/activate
# data/training.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med
wget https://github.com/verypluming/HELP/raw/master/output_en/pmb_train_v1.0.tsv
python -m sarn.convert.help
cat data/med.csv data/help.csv > data/training.csv
# data/evaluation.csv
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas
wget -O diagnostic-full.tsv https://www.dropbox.com/s/ju7d95ifb072q9f/diagnostic-full.tsv?dl=1
python -m sarn.convert.superglue
cat data/fracas.csv data/superglue.csv > data/evaluation.csv
# data/training-adj.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/med_adjectives_1.csv by hand (third column)
# - label data/med_adjectives_2.csv by hand (third column)
# - remove fourth column in both files
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas_adjectives
cat data/med_adjectives_1.csv data/med_adjectives_2.csv data/fracas_adjectives.csv > data/training-adj.csv
python -m sarn.validate_datasets data/training-adj.csv
# data/evaluation-adj.csv
python -m sarn.convert.evaluation_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/evaluation-adj.csv by hand (third column)
# - remove fourth column
python -m sarn.validate_datasets data/evaluation-adj.csv
Dataset | avg | median | min | max |
---|---|---|---|---|
data/training.csv |
||||
Premises | 48.26 | 41 | 5 | 478 |
Hypotheses | 48.93 | 42 | 5 | 478 |
data/evaluation.csv |
||||
Premises | 79.84 | 58 | 26 | 206 |
Hypotheses | 61.57 | 50 | 26 | 186 |
data/training-adj.csv |
||||
Premises | 48.93 | 44 | 14 | 212 |
Hypotheses | 50.78 | 46 | 18 | 210 |
data/evaluation-adj.csv |
||||
Premises | 100.19 | 83 | 25 | 189 |
Hypotheses | 86.62 | 69 | 35 | 189 |
Dataset | avg | median | min | max |
---|---|---|---|---|
data/training.csv |
||||
Premises | 9.98 | 9 | 2 | 83 |
Hypotheses | 10.10 | 9 | 2 | 83 |
data/evaluation.csv |
||||
Premises | 13.03 | 10 | 5 | 34 |
Hypotheses | 10.14 | 9 | 5 | 30 |
data/training-adj.csv |
||||
Premises | 8.89 | 8 | 3 | 29 |
Hypotheses | 9.18 | 9 | 3 | 29 |
data/evaluation-adj.csv |
||||
Premises | 15.44 | 12 | 5 | 31 |
Hypotheses | 13.47 | 11 | 5 | 31 |
Dataset | total | contradiction | neutral | entailment |
---|---|---|---|---|
data/training.csv |
41'273 | 0 (0.00%) | 20'699 (50.15%) | 20'574 (49.85%) |
data/evaluation.csv |
118 | 15 (12.71%) | 52 (44.07%) | 51 (43.22%) |
data/training-adj.csv |
1'206 | 420 (34.83%) | 749 (62.11%) | 37 (3.07%) |
data/evaluation-adj.csv |
144 | 47 (32.64%) | 84 (58.33%) | 13 (9.03%) |
facebook/bart-large-mnli
: Pretrained model of BART finetuned on MultiNLImicrosoft/deberta-large-mnli
: Pretrained model of DeBERTa finetuned on MultiNLImodels/bart-mq
: finetuned version offacebook/bart-large-mnli
ondata/training.csv
models/deberta-mq
: finetuned version ofmicrosoft/deberta-large-mnli
ondata/training.csv
models/bart-adj
: finetuned version ofmodels/bart-mq
ondata/training-adj.csv
models/deberta-adj
: finetuned version ofmodels/deberta-mq
ondata/training-adj.csv
source .venv/bin/activate
# facebook/bart-large-mnli and microsoft/deberta-large-mnli will automatically
# be downloaded from huggingface.co when used
# models/bart-mq
python -m sarn.train --output-dir "models/bart-mq" --log-dir "logs/bart-mq" "facebook/bart-large-mnli" "data/training.csv"
# models/deberta-mq
python -m sarn.train --output-dir "models/deberta-mq" --log-dir "logs/deberta-mq" "microsoft/deberta-large-mnli" "data/training.csv"
# models/bart-adj
python -m sarn.train --output-dir "models/bart-adj" --log-dir "logs/bart-adj" "facebook/bart-large-mnli" "data/training-adj.csv"
# models/deberta-adj
python -m sarn.train --output-dir "models/deberta-adj" --log-dir "logs/deberta-adj" "microsoft/deberta-large-mnli" "data/training-adj.csv"
Model | data/evaluation.csv |
data/evaluation-adj.csv |
---|---|---|
facebook/bart-large-mnli |
65.25% | 40.97% |
microsoft/deberta-large-mnli |
71.19% | 47.22% |
models/bart-mq |
57.63% | 34.72% |
models/deberta-mq |
61.86% | 34.72% |
models/bart-adj |
45.76% | 58.33% |
models/deberta-adj |
42.37% | 57.64% |
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. [...] AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). [...] AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
BART | DeBERTa |
---|---|