Skip to content

osekilab/JBLiMP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

JBLiMP: Japanese Benchmark of Linguistic Minimal Pairs

JBLiMP is a novel dataset for targeted syntactic evaluations of language models in Japanese. JBLiMP consists of 331 minimal pairs, which are created based on acceptability judgments extracted from journal articles in theoretical linguistics. These minimal pairs are grouped into 11 categories, each covering a different linguistic phenomenon.

Contents

  • Minimal pairs before human validation (367 pairs) can be found in JBLiMP/data/raw.
  • Minimal pairs after human validation (331 pairs) can be found in JBLiMP/data/validated.
    • In order to validate the quality of minimal pairs in JBLiMP, we conducted an acceptability judgement experiment. For each minimal pair, if the annotation of JBLiMP and the majority vote of human annotations do not match, that minimal pair is removed from JBLiMP.
    • These human-validated minimal pairs were used for the evaluation of our language models.
  • Our paper accepted in EACL 2023 can be found in JBLiMP/paper/someya_oseki_2023.pdf.

Data Format

Name Description
ID Ids of minimal pairs
year Publication years of source articles
author Authors of source articles
{good, bad}_num Example numbers in source articles
{good, bad}_diacritic Acceptability judgements in source articles
{good, bad}_sentence_raw Raw example sentences in source articles (there are some fragmented sentences)
{good, bad}_sentence Example sentences in JBLiMP (fragmented sentences are augumented)
{good, bad}_gloss Glosses in source articles
{good, bad}_translation English translations in source articles
type Categorization based on the type of acceptability judgements and how those sentences were presented in source articles
phenomenon Categorization based on linguistic phenomena
phenomenon-2 Categorization based on linguistic phenomena
paradigm Sub-categorization of phenomenon

Model Evaluation

We evaluated the syntactic knowledge of several language models on JBLiMP: GPT-2, LSTM, and n-gram language models trained by (Kuribayashi et al, 2021). All the models achieved comparative accuracy around 76% and human baseline accuracy was 90.90%.

Model Overall Argument Structure Verbal Agreement Morphology Nominal Structure Ellipsis Quantifiers Binding Island effects Filer-gap NPI Licensing Control/Raising
Trans-LG 77.95 89.05 53.55 82.86 95.65 85.96 73.81 58.97 75.76 55.56 50.00 16.67
Trans-SM 76.54 89.05 44.26 82.86 97.10 89.47 71.43 46.15 84.85 55.56 75.00 0.00
LSTM 75.73 86.67 46.99 83.81 95.65 91.23 66.67 41.03 87.88 44.44 66.67 50.00
5-gram 74.02 78.57 57.38 82.86 86.96 89.47 78.57 53.85 72.73 66.67 50.00 0.00
Human 90.90 92.19 89.62 94.86 97.68 87.37 85.71 82.05 92.12 78.52 90.00 70.00
Model Ave. 76.06 85.76 50.55 83.10 93.84 89.03 72.62 50.00 80.31 55.56 60.42 16.67

Accuracy is averaged over 3 different random seeds except 5-gram and human. The numbers in bold indicate the best score within a model, while the number with underscore indicates the worst score

Recommended Citation

Taiga Someya and Yohei Oseki. 2023. JBLiMP: Japanese Benchmark of Linguistic Minimal Pairs. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1581–1594, Dubrovnik, Croatia. Association for Computational Linguistics.

Licence

Most of the example sentences are extracted from the linguistic journals without any modification. Hence, in most cases, the copyright of the example sentences remains with the original authors or publishers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published