Transformers constitute the current paradigm within Natural Language Processing (NLP) for a variety of downstream tasks. The number of transformers trained on danish corpora are limited, which is why the ambition of this repository is to provide the danish NLP community with alternatives to already established models. The pretrained models in this repository are trained using 🤗Transformers, and checkpoints are made available at the HuggingFace model hub here for both PyTorch and TensorFlow.
Details on how to use the models, can be found by clicking the architecture headers.
convbert-small-da-cased
: 12-layer, 256-hidden, 4-heads**convbert-medium-small-da-cased
: 12-layer, 384-hidden, 6-heads**
electra-small-discriminator-da-256-cased
: 12-layer, 256-hidden, 4-headselectra-small-generator-da-256-cased
: 12-layer, 64-hidden, 1-heads
**Pretrained using ELECTRA pretraining approach.
All downstream task benchmarks are evaluated on finetuned versions of the transformer models.
The dataset used for benchmarking both NER and POS tagging, is the Danish Dependency Treebank UD-DDT.
All models were trained for 3 epochs on the train set.
All scores reported, are averages calculated from (N=5) random seed runs for each model, where σ
refers to the standard deviation.
The table below shows the F1-scores
on the test+dev set on the entities LOC
, ORG
, PER
and MISC
over (N=5) runs.
Model | Params | LOC | ORG | PER | MISC | Micro AVG |
---|---|---|---|---|---|---|
bert-base-multilingual-cased | ~177M | 87.02 | 75.24 | 91.28 | 75.94 | 83.18 (σ=0.81) |
danish-bert-uncased-v2 | ~110M | 87.40 | 75.43 | 93.92 | 76.21 | 84.19 (σ=0.75) |
+++++++++++++++++++++++++++ | +++++ | ++++ | ++++ | ++++ | ++++ | +++++++++++ |
convbert-medium-small-da-cased | ~24.3M | 88.61 | 75.97 | 90.15 | 77.07 | 83.54 (σ=0.55) |
convbert-small-da-cased | ~12.9M | 85.86 | 71.21 | 89.07 | 73.50 | 80.76 (σ=0.40) |
electra-small-da-cased | ~13.3M | 86.30 | 70.05 | 88.34 | 71.31 | 79.63 (σ=0.22) |
The table below shows the F1-scores
on the test+dev set over (N=5) runs.
Model | Params | Micro AVG |
---|---|---|
bert-base-multilingual-cased | ~177M | 97.42 (σ=0.09) |
danish-bert-uncased-v2 | ~110M | 98.08 (σ=0.05) |
+++++++++++++++++++++++++++ | +++++ | +++++++++++ |
convbert-medium-small-da-cased | ~24.3M | 97.92 (σ=0.03) |
convbert-small-da-cased | ~12.9M | 97.32 (σ=0.03) |
electra-small-da-cased | ~13.3M | 97.42 (σ=0.05) |
The custom danish corpora used for pretraining, was created from the following sources:
- Oscar ~9.5gb
- Leipzig danish corpora ~1.5gb
- Wikipedia Monolingual Corpora ~1.5gb
- OPUS ~3gb
- DaNewsroom ~2gb
All characters in the corpus were transliterated to ASCII with the exception of æøåÆØŧ
.
Sources containing web crawled data, were cleaned of overrepresented NSFW ads and commercials.
The final dataset consists of 14,483,456
precomputed tensors of length 256.
- Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan 2020. ConvBERT: Improving BERT with Span-based Dynamic Convolution
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
- Pedro Javier Ortiz Suárez, Laurent Romary, Benoît Sagot. 2020. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
- Daniel Varab, Natalie Schluter. 2020. DaNewsroom: A Large-scale Danish Summarisation Dataset
- Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard and Anders Søgaard. 2020. DaNE: A Named Entity Resource for Danish
to cite this work please use
@inproceedings{danish-transformers,
title = {Danish Transformers},
author = {Tamimi-Sarnikowski, Philip},
year = {2020},
publisher = {{GitHub}},
url = {https://github.com/sarnikowski}
}
This work is licensed under a Creative Commons Attribution 4.0 International License.