Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions model_cards/distilbert-base-cased-README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,40 @@
---
language: en
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
---

# DistilBERT base model (cased)

This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-cased).
It was introduced in [this paper](https://arxiv.org/abs/1910.01108).
The code for the distillation process can be found
[here](https://github.com/huggingface/transformers/tree/master/examples/distillation).
This model is cased: it does make a difference between english and English.

All the training details on the pre-training, the uses, limitations and potential biases are the same as for [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased).
We highly encourage to check it if you want to know more.

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
| | 81.5 | 87.8 | 88.2 | 90.4 | 47.2 | 85.5 | 85.6 | 60.6 |

### BibTeX entry and citation info

```bibtex
@article{Sanh2019DistilBERTAD,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
journal={ArXiv},
year={2019},
volume={abs/1910.01108}
}
```
5 changes: 5 additions & 0 deletions model_cards/distilbert-base-cased-distilled-squad-README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,8 @@ metrics:
- squad
license: apache-2.0
---

# DistilBERT base cased distilled SQuAD

This model is a fine-tune checkpoint of [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased), fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1.
This model reaches a F1 score of 87.1 on the dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7).
31 changes: 31 additions & 0 deletions model_cards/distilbert-base-multilingual-cased-README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,35 @@
---
language: multilingual
license: apache-2.0
datasets:
- wikipedia
---

# DistilBERT base multilingual model (cased)

This model is a distilled version of the [BERT base multilingual model](bert-base-multilingual-cased). The code for the distillation process can be found
[here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is cased: it does make a difference between english and English.

The model is trained on the concatenation of Wikipedia in 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base).
On average DistilmBERT is twice as fast as mBERT-base.

We encourage to check [BERT base multilingual model](bert-base-multilingual-cased) to know more about usage, limitations and potential biases.

| Model | English | Spanish | Chinese | German | Arabic | Urdu |
| :---: | :---: | :---: | :---: | :---: | :---: | :---:|
| mBERT base cased (computed) | 82.1 | 74.6 | 69.1 | 72.3 | 66.4 | 58.5 |
| mBERT base uncased (reported)| 81.4 | 74.3 | 63.8 | 70.5 | 62.1 | 58.3 |
| DistilmBERT | 78.2 | 69.1 | 64.0 | 66.3 | 59.1 | 54.7 |

### BibTeX entry and citation info

```bibtex
@article{Sanh2019DistilBERTAD,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
journal={ArXiv},
year={2019},
volume={abs/1910.01108}
}
```
10 changes: 5 additions & 5 deletions model_cards/distilbert-base-uncased-README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ datasets:

# DistilBERT base model (uncased)

This model is a distilled version of the [BERT base mode](https://huggingface.co/distilbert-base-uncased). It was
This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-uncased). It was
introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found
[here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is uncased: it does
not make a difference between english and English.
Expand Down Expand Up @@ -102,7 +102,7 @@ output = model(encoded_input)

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
predictions. It also inherits some of
[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).
[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).

```python
>>> from transformers import pipeline
Expand Down Expand Up @@ -196,9 +196,9 @@ When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 | 77.0 |
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |


### BibTeX entry and citation info
Expand Down
6 changes: 6 additions & 0 deletions model_cards/distilbert-base-uncased-distilled-squad-README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
---
language: en
datasets:
- squad
widget:
Expand All @@ -8,3 +9,8 @@ widget:
context: "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain \"Amazonas\" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
license: apache-2.0
---

# DistilBERT base uncased distilled SQuAD

This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1.
This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert bert-base-uncased version reaches a F1 score of 88.5).
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
---
language: en
license: apache-2.0
datasets:
- sst-2
---

# DistilBERT base uncased finetuned SST-2

This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned on SST-2.
This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

# Fine-tuning hyper-parameters

- learning_rate = 1e-5
- batch_size = 32
- warmup = 600
- max_seq_length = 128
- num_train_epochs = 3.0
11 changes: 11 additions & 0 deletions model_cards/distilgpt2-README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,21 @@
---
language: en
tags:
- exbert

license: apache-2.0
datasets:
- openwebtext
---

# DistilGPT2

DistilGPT2 English language model pretrained with the supervision of [GPT2](https://huggingface.co/gpt2) (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.

On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for DistilGPT2 (after fine-tuning on the train set).

We encourage to check [GPT2](https://huggingface.co/gpt2) to know more about usage, limitations and potential biases.

<a href="https://huggingface.co/exbert/?model=distilgpt2">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>
40 changes: 40 additions & 0 deletions model_cards/distilroberta-base-README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,50 @@
---
language: en
tags:
- exbert

license: apache-2.0
datasets:
- openwebtext
---

# DistilRoBERTa base model

This model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased).
The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation).
This model is case-sensitive: it makes a difference between english and English.

The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base).
On average DistilRoBERTa is twice as fast as Roberta-base.

We encourage to check [RoBERTa-base model](https://huggingface.co/roberta-base) to know more about usage, limitations and potential biases.

## Training data

DistilRoBERTa was pre-trained on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa).

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

Glue test results:

| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
| | 84.0 | 89.4 | 90.8 | 92.5 | 59.3 | 88.3 | 86.6 | 67.9 |

### BibTeX entry and citation info

```bibtex
@article{Sanh2019DistilBERTAD,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
journal={ArXiv},
year={2019},
volume={abs/1910.01108}
}
```

<a href="https://huggingface.co/exbert/?model=distilroberta-base">
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
</a>