Purpose of this project is to convert NER data set provided by lang-uk group from Brat Standoff Format to BEIOS format required by Stanford Stanza Library
You can also use this tool convert any other NER data set from BSF to BEIOS format. But mind that it will only convert simple entity tgs (T), no overlapping, relations, events are supported for now.
- Clone lang-uk data set to the folder of your choice (later referred as $SRC_DATASET)
git clone https://github.com/lang-uk/ner-uk
- Run conversion script
python bsf_to_beios.py --src_dataset $SRC_DATASET/data
Data will be saved to ../ner-base/
dir. Or you can change this path with --dst
argument.
If --split_file
is not specified, the script will randomly split the data into train, dev, test sets.
Otherwise, data will be processed according to provided file.
Example to convert to iob
python src/bsf_beios/bsf_to_beios.py --split_file "../ner-uk/doc/dev-test-split.txt" -c 'iob' --dst "../"
After obtaining *.bio
files you can run Stanza NER training.
Make sure to follow instructions at https://stanfordnlp.github.io/stanza/training.html. There are all sorts of naming gotchas that you want to avoid.
After necessary configuration you will be able to run NER model training
scripts/run_ner.sh Ukrainian-languk
import stanza
nlp = stanza.Pipeline('uk', processors='tokenize,pos,lemma',
ner_model_path='your_path/saved_models/ner/uk_languk_nertagger.pt',
ner_forward_charlm_path="", ner_backward_charlm_path="")
## Recent training results
Training ended with 34000 steps.
Best dev F1 = 84.24, at iteration = 22000
Running tagger in predict mode
Loading data with batch size 32...
41 batches created.
Start evaluation...
Prec. Rec. F1
84.58 83.89 84.24
NER tagger score:
uk_languk 84.24
Running tagger in predict mode
Loading data with batch size 32...
37 batches created.
Start evaluation...
Prec. Rec. F1
86.86 85.25 86.05
NER tagger score:
uk_languk 86.05
"Корпус NER-анотацій українських текстів" by lang-uk is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Based on a work at https://github.com/lang-uk/ner-uk.