Skip to content

Converting training data set built by lang-uk community to the format supported by Stanza NLP library

Notifications You must be signed in to change notification settings

gawy/stanza-lang-uk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ukrainian NER data set conversion to be used by Stanza (Stanford NLP Library)

Purpose of this project is to convert NER data set provided by lang-uk group from Brat Standoff Format to BEIOS format required by Stanford Stanza Library

You can also use this tool convert any other NER data set from BSF to BEIOS format. But mind that it will only convert simple entity tgs (T), no overlapping, relations, events are supported for now.

Usage

  1. Clone lang-uk data set to the folder of your choice (later referred as $SRC_DATASET)
git clone https://github.com/lang-uk/ner-uk
  1. Run conversion script
python bsf_to_beios.py --src_dataset $SRC_DATASET/data

Data will be saved to ../ner-base/ dir. Or you can change this path with --dst argument.

If --split_file is not specified, the script will randomly split the data into train, dev, test sets. Otherwise, data will be processed according to provided file.

Example to convert to iob

python src/bsf_beios/bsf_to_beios.py --split_file "../ner-uk/doc/dev-test-split.txt" -c 'iob' --dst "../"

Stanza training

After obtaining *.bio files you can run Stanza NER training.

Make sure to follow instructions at https://stanfordnlp.github.io/stanza/training.html. There are all sorts of naming gotchas that you want to avoid.

After necessary configuration you will be able to run NER model training

scripts/run_ner.sh Ukrainian-languk

Using trained model

import stanza
nlp = stanza.Pipeline('uk', processors='tokenize,pos,lemma', 
                      ner_model_path='your_path/saved_models/ner/uk_languk_nertagger.pt', 
                      ner_forward_charlm_path="", ner_backward_charlm_path="")
## Recent training results
Training ended with 34000 steps.
Best dev F1 = 84.24, at iteration = 22000

Running tagger in predict mode
Loading data with batch size 32...
41 batches created.
Start evaluation...
Prec.	Rec.	F1
84.58	83.89	84.24
NER tagger score:
uk_languk 84.24

Running tagger in predict mode
Loading data with batch size 32...
37 batches created.
Start evaluation...
Prec.	Rec.	F1
86.86	85.25	86.05
NER tagger score:
uk_languk 86.05

Kudos

"Корпус NER-анотацій українських текстів" by lang-uk is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Based on a work at https://github.com/lang-uk/ner-uk.

About

Converting training data set built by lang-uk community to the format supported by Stanza NLP library

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages