Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support MasakhaNER v2 dataset #3013

Merged
merged 2 commits into from
Dec 14, 2022
Merged

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented Dec 6, 2022

Hi,

this PR adds support for the recently released version 2 of the MasakhaNER dataset!

A new version argument was added and now defaults to "v2" of the dataset. Demo usage:

from flair.datasets import NER_MASAKHANE

v1_all = NER_MASAKHANE(languages="all", version="v1")
v2_all = NER_MASAKHANE(languages="all", version="v2")

Closes #2971.

Unittests

This PR also adds some unittests for v1 and v2 of the dataset. Here's a comparison between the number of sentences, that are mentioned in the paper and the "Parsed" ones with this implementation:

Version 1 - bold denotes difference:

Language Paper (Train, Dev, Test) Flair (Train, Dev, Test)
amh 1750 / 250 / 500 1750 / 250 / 500
hau 1903 / 272 / 545 1903 / 272 / 545
ibo 2233 / 319 / 638 2233 / 319 / 639
kin 2110 / 301 / 604 2110 / 301 / 604
lug 2003 / 200 / 401 2003 / 200 / 401
luo 644 / 92 / 185 644 / 92 / 185
pcm 2100 / 300 / 600 2100 / 300 / 600
swa 2104 / 300 / 602 2104 / 300 / 602
wol 1871 / 267 / 536 1871 / 267 / 536
yor 2124 / 303 / 608 2124 / 303 / 608

Version 2 - bold denotes difference:

Language Paper (Train, Dev, Test) Flair (Train, Dev, Test)
bam 4462 / 638 / 1274 4462 / 638 / 1274
bbj 3384 / 483 / 966 3384 / 483 / 966
ewe 3505 / 501 / 1001 3505 / 501 / 1001
fon 4343 / 621 / 1240 4343 / 621 / 1240
hau 5716 / 816 / 1633 5716 / 816 / 1633
ibo 7634 / 1090 / 2181 7634 / 1090 / 2181
kin 7825 / 1118 / 2235 7825 / 1118 / 2235
lug 4942 / 706 / 1412 4942 / 706 / 1412
mos 4532 / 648 / 1294 4532 / 648 / 1294
pcm 5646 / 806 / 1613 5646 / 806 / 1613
nya 6250 / 893 / 1785 6250 / 893 / 1785
sna 6207 / 887 / 1773 6207 / 887 / 1773
swa 6593 / 942 / 1883 6593 / 942 / 1883
tsn 3489 / 499 / 996 3489 / 499 / 996
twi 4240 / 605 / 1211 4240 / 605 / 1211
wol 4593 / 656 / 1312 4593 / 656 / 1312
xho 5718 / 817 / 1633 5718 / 817 / 1633
yor 6877 / 983 / 1964 6876 / 983 / 1964
zul 5848 / 836 / 1670 5848 / 836 / 1670

@alanakbik
Copy link
Collaborator

@stefan-it thanks for adding this!

@alanakbik alanakbik merged commit 0cd260e into master Dec 14, 2022
@alanakbik alanakbik deleted the add-support-maskhaner-v2 branch December 14, 2022 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for new MasakhaNER v2 dataset
2 participants