diff --git a/model_cards/distilbert-base-cased-README.md b/model_cards/distilbert-base-cased-README.md index 154df8298fab..184ee3acc4ec 100644 --- a/model_cards/distilbert-base-cased-README.md +++ b/model_cards/distilbert-base-cased-README.md @@ -1,3 +1,40 @@ --- +language: en license: apache-2.0 +datasets: +- bookcorpus +- wikipedia --- + +# DistilBERT base model (cased) + +This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-cased). +It was introduced in [this paper](https://arxiv.org/abs/1910.01108). +The code for the distillation process can be found +[here](https://github.com/huggingface/transformers/tree/master/examples/distillation). +This model is cased: it does make a difference between english and English. + +All the training details on the pre-training, the uses, limitations and potential biases are the same as for [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased). +We highly encourage to check it if you want to know more. + +## Evaluation results + +When fine-tuned on downstream tasks, this model achieves the following results: + +Glue test results: + +| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | +|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| +| | 81.5 | 87.8 | 88.2 | 90.4 | 47.2 | 85.5 | 85.6 | 60.6 | + +### BibTeX entry and citation info + +```bibtex +@article{Sanh2019DistilBERTAD, + title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, + author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, + journal={ArXiv}, + year={2019}, + volume={abs/1910.01108} +} +``` diff --git a/model_cards/distilbert-base-cased-distilled-squad-README.md b/model_cards/distilbert-base-cased-distilled-squad-README.md index e053e3a594ae..2f92ff7ae9e7 100644 --- a/model_cards/distilbert-base-cased-distilled-squad-README.md +++ b/model_cards/distilbert-base-cased-distilled-squad-README.md @@ -6,3 +6,8 @@ metrics: - squad license: apache-2.0 --- + +# DistilBERT base cased distilled SQuAD + +This model is a fine-tune checkpoint of [DistilBERT-base-cased](https://huggingface.co/distilbert-base-cased), fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1. +This model reaches a F1 score of 87.1 on the dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7). diff --git a/model_cards/distilbert-base-multilingual-cased-README.md b/model_cards/distilbert-base-multilingual-cased-README.md index 6db12d45e518..2fa58c2575a7 100644 --- a/model_cards/distilbert-base-multilingual-cased-README.md +++ b/model_cards/distilbert-base-multilingual-cased-README.md @@ -1,4 +1,35 @@ --- language: multilingual license: apache-2.0 +datasets: +- wikipedia --- + +# DistilBERT base multilingual model (cased) + +This model is a distilled version of the [BERT base multilingual model](bert-base-multilingual-cased). The code for the distillation process can be found +[here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is cased: it does make a difference between english and English. + +The model is trained on the concatenation of Wikipedia in 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). +The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). +On average DistilmBERT is twice as fast as mBERT-base. + +We encourage to check [BERT base multilingual model](bert-base-multilingual-cased) to know more about usage, limitations and potential biases. + +| Model | English | Spanish | Chinese | German | Arabic | Urdu | +| :---: | :---: | :---: | :---: | :---: | :---: | :---:| +| mBERT base cased (computed) | 82.1 | 74.6 | 69.1 | 72.3 | 66.4 | 58.5 | +| mBERT base uncased (reported)| 81.4 | 74.3 | 63.8 | 70.5 | 62.1 | 58.3 | +| DistilmBERT | 78.2 | 69.1 | 64.0 | 66.3 | 59.1 | 54.7 | + +### BibTeX entry and citation info + +```bibtex +@article{Sanh2019DistilBERTAD, + title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, + author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, + journal={ArXiv}, + year={2019}, + volume={abs/1910.01108} +} +``` diff --git a/model_cards/distilbert-base-uncased-README.md b/model_cards/distilbert-base-uncased-README.md index 2fda2b283e5a..9b4358201c6b 100644 --- a/model_cards/distilbert-base-uncased-README.md +++ b/model_cards/distilbert-base-uncased-README.md @@ -10,7 +10,7 @@ datasets: # DistilBERT base model (uncased) -This model is a distilled version of the [BERT base mode](https://huggingface.co/distilbert-base-uncased). It was +This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-uncased). It was introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation). This model is uncased: it does not make a difference between english and English. @@ -102,7 +102,7 @@ output = model(encoded_input) Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. It also inherits some of -[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias). +[the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias). ```python >>> from transformers import pipeline @@ -196,9 +196,9 @@ When fine-tuned on downstream tasks, this model achieves the following results: Glue test results: -| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | -|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:| -| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 | 77.0 | +| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | +|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| +| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 | ### BibTeX entry and citation info diff --git a/model_cards/distilbert-base-uncased-distilled-squad-README.md b/model_cards/distilbert-base-uncased-distilled-squad-README.md index 228cdf13aaec..6765229e6280 100644 --- a/model_cards/distilbert-base-uncased-distilled-squad-README.md +++ b/model_cards/distilbert-base-uncased-distilled-squad-README.md @@ -1,4 +1,5 @@ --- +language: en datasets: - squad widget: @@ -8,3 +9,8 @@ widget: context: "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain \"Amazonas\" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species." license: apache-2.0 --- + +# DistilBERT base uncased distilled SQuAD + +This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1. +This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert bert-base-uncased version reaches a F1 score of 88.5). diff --git a/model_cards/distilbert-base-uncased-finetuned-sst-2-english-README.md b/model_cards/distilbert-base-uncased-finetuned-sst-2-english-README.md index 154df8298fab..d33b5862630e 100644 --- a/model_cards/distilbert-base-uncased-finetuned-sst-2-english-README.md +++ b/model_cards/distilbert-base-uncased-finetuned-sst-2-english-README.md @@ -1,3 +1,19 @@ --- +language: en license: apache-2.0 +datasets: +- sst-2 --- + +# DistilBERT base uncased finetuned SST-2 + +This model is a fine-tune checkpoint of [DistilBERT-base-uncased](https://huggingface.co/distilbert-base-uncased), fine-tuned on SST-2. +This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7). + +# Fine-tuning hyper-parameters + +- learning_rate = 1e-5 +- batch_size = 32 +- warmup = 600 +- max_seq_length = 128 +- num_train_epochs = 3.0 diff --git a/model_cards/distilgpt2-README.md b/model_cards/distilgpt2-README.md index caa244e364ac..41e1a5a1e758 100644 --- a/model_cards/distilgpt2-README.md +++ b/model_cards/distilgpt2-README.md @@ -1,10 +1,21 @@ --- +language: en tags: - exbert license: apache-2.0 +datasets: +- openwebtext --- +# DistilGPT2 + +DistilGPT2 English language model pretrained with the supervision of [GPT2](https://huggingface.co/gpt2) (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2. + +On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 16.3 compared to 21.1 for DistilGPT2 (after fine-tuning on the train set). + +We encourage to check [GPT2](https://huggingface.co/gpt2) to know more about usage, limitations and potential biases. + diff --git a/model_cards/distilroberta-base-README.md b/model_cards/distilroberta-base-README.md index 6c518b4522a5..18bbbb860874 100644 --- a/model_cards/distilroberta-base-README.md +++ b/model_cards/distilroberta-base-README.md @@ -1,10 +1,50 @@ --- +language: en tags: - exbert license: apache-2.0 +datasets: +- openwebtext --- +# DistilRoBERTa base model + +This model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased). +The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/distillation). +This model is case-sensitive: it makes a difference between english and English. + +The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). +On average DistilRoBERTa is twice as fast as Roberta-base. + +We encourage to check [RoBERTa-base model](https://huggingface.co/roberta-base) to know more about usage, limitations and potential biases. + +## Training data + +DistilRoBERTa was pre-trained on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). + +## Evaluation results + +When fine-tuned on downstream tasks, this model achieves the following results: + +Glue test results: + +| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | +|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| +| | 84.0 | 89.4 | 90.8 | 92.5 | 59.3 | 88.3 | 86.6 | 67.9 | + +### BibTeX entry and citation info + +```bibtex +@article{Sanh2019DistilBERTAD, + title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, + author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, + journal={ArXiv}, + year={2019}, + volume={abs/1910.01108} +} +``` +