Skip to content

Latest commit

 

History

History
109 lines (78 loc) · 6.97 KB

README.md

File metadata and controls

109 lines (78 loc) · 6.97 KB

🤖 Danish Transformers

Transformers constitute the current paradigm within Natural Language Processing (NLP) for a variety of downstream tasks. The number of transformers trained on danish corpora are limited, which is why the ambition of this repository is to provide the danish NLP community with alternatives to already established models. The pretrained models in this repository are trained using 🤗Transformers, and checkpoints are made available at the HuggingFace model hub here for both PyTorch and TensorFlow.

Model Weights

Details on how to use the models, can be found by clicking the architecture headers.

**Pretrained using ELECTRA pretraining approach.

Benchmarks

All downstream task benchmarks are evaluated on finetuned versions of the transformer models. The dataset used for benchmarking both NER and POS tagging, is the Danish Dependency Treebank UD-DDT. All models were trained for 3 epochs on the train set. All scores reported, are averages calculated from (N=5) random seed runs for each model, where σ refers to the standard deviation.

Named Entity Recognition

The table below shows the F1-scores on the test+dev set on the entities LOC, ORG, PER and MISC over (N=5) runs.

Model Params LOC ORG PER MISC Micro AVG
bert-base-multilingual-cased ~177M 87.02 75.24 91.28 75.94 83.18 (σ=0.81)
danish-bert-uncased-v2 ~110M 87.40 75.43 93.92 76.21 84.19 (σ=0.75)
+++++++++++++++++++++++++++ +++++ ++++ ++++ ++++ ++++ +++++++++++
convbert-medium-small-da-cased ~24.3M 88.61 75.97 90.15 77.07 83.54 (σ=0.55)
convbert-small-da-cased ~12.9M 85.86 71.21 89.07 73.50 80.76 (σ=0.40)
electra-small-da-cased ~13.3M 86.30 70.05 88.34 71.31 79.63 (σ=0.22)

Part-of-speech Tagging

The table below shows the F1-scores on the test+dev set over (N=5) runs.

Model Params Micro AVG
bert-base-multilingual-cased ~177M 97.42 (σ=0.09)
danish-bert-uncased-v2 ~110M 98.08 (σ=0.05)
+++++++++++++++++++++++++++ +++++ +++++++++++
convbert-medium-small-da-cased ~24.3M 97.92 (σ=0.03)
convbert-small-da-cased ~12.9M 97.32 (σ=0.03)
electra-small-da-cased ~13.3M 97.42 (σ=0.05)

Data

The custom danish corpora used for pretraining, was created from the following sources:

All characters in the corpus were transliterated to ASCII with the exception of æøåÆØŧ. Sources containing web crawled data, were cleaned of overrepresented NSFW ads and commercials. The final dataset consists of 14,483,456 precomputed tensors of length 256.

References

Cite this work

to cite this work please use

@inproceedings{danish-transformers,
  title = {Danish Transformers},
  author = {Tamimi-Sarnikowski, Philip},
  year = {2020},
  publisher = {{GitHub}},
  url = {https://github.com/sarnikowski}
}

License

CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.