We provide the documentation about how to download and prepare the GLUE and SuperGLUE.
These benchmarks share the common goal of providing a robust set of downstream tasks for evaluating the NLP models' performance.
In essence, these NLP tasks share a similar structure. We are interested in the question: can we design a model that can solve these tasks all in once? BERT has done a good job in unifying the way to featurize the text data, in which we extract two types of embeddings: one for the whole sentence and the other for each tokens in the sentence. Later, in T5, the author proposed to convert every task into a text-to-text problem. However, it is difficult to convert tasks like measuring the similarity between sentences, or named-entity recognition into text-to-text, because they involve real-values or entity spans that are difficult to be encoded as raw text data.
In GluonNLP, we propose a unified way to tackle these NLP problems. We convert these datasets as tables. Each column in the table will be 1) raw text, 2) entity/list of entities associated with the raw text, 3) numerical values or a list of numerical values. In addition, we keep a metadata object that describes 1) the relationship among columns, 2) certain properties of the columns.
All tasks used in these general benchmarks are converted to this format.
The details of the benchmark are described in GLUE Paper.
To obtain the datasets, run:
nlp_data prepare_glue --benchmark glue
There will be multiple task folders. All data are converted into pandas dataframes + an additional
metadata.json
object.
Here are the details of the datasets:
Dataset | #Train | #Dev | #Test | Columns | Task | Metrics | Domain |
---|---|---|---|---|---|---|---|
CoLA | 8.5k | 1k | 1k | sentence, label | acceptability (0 / 1) | Matthews corr. | misc. |
SST-2 | 67k | 872 | 1.8k | sentence, label | sentiment | acc. | movie reviews |
MRPC | 3.7k | 408 | 1.7k | sentence1, sentence2, label | paraphrase | acc./F1 | news |
STS-B | 5.7k | 1.5k | 1.4k | sentence1, sentence2, score | sentence similarity | Pearson/Spearman corr. | misc. |
QQP | 364k | 40k | 391k | sentence1, sentence2, label | paraphrase | acc./F1 | social QA questions |
MNLI | 393k | 9.8k(m) / 9.8k(mm) | 9.8k(m) / 9.8k(mm) | sentence1, sentence2, genre, label | NLI | matched acc./mismatched acc. | misc |
QNLI | 105k | 5.4k | 5.4k | question, sentence, label | QA/NLI | acc. | Wikipedia |
RTE | 2.5k | 227 | 3k | sentence1, sentence2, label | NLI | acc. | news, Wikipedia |
WNLI | 634 | 71 | 146 | sentence1, sentence2, label | NLI | acc. | fiction books |
In addition, GLUE has the diagnostic task that tries to analyze the system's performance on a broad range of linguistic phenomena. It is best described in GLUE Diagnostic. The diagnostic dataset is based on Natural Language Inference (NLI) and you will need to use the model trained with MNLI on this dataset.
Dataset | #Sample | Data Format | Metrics |
---|---|---|---|
Diagnostic | 1104 | semantics, predicate, logic, knowledge, domain, premise, hypothesis, label | Matthews corr. |
In addition, we provide the SNLI dataset, which is recommend as an auxiliary data source for training MNLI. This is the recommended approach in GLUE.
Dataset | #Train | #Test | Data Format | Task | Metrics | Domain |
---|---|---|---|---|---|---|
SNLI | 549K | 20k | sentence1, sentence2, label | NLI | acc. | misc |
The details are described in SuperGLUE Paper.
To obtain the datasets, run:
nlp_data prepare_glue --benchmark superglue
Dataset | #Train | #Dev | #Test | Columns | Task | Metrics | Domain |
---|---|---|---|---|---|---|---|
BoolQ | 9.4k | 3.3k | 3.2k | passage, question, label | QA | acc. | Google queries, Wikipedia |
CB | 250 | 57 | 250 | premise, hypothesis, label | NLI | acc./F1 | various |
COPA | 400 | 100 | 500 | premise, choice1, choice2, question, label | QA | acc. | blogs, photography encyclopedia |
MultiRC* | 5.1k (27k) | 953 (4.8k) | 1.8k (9.7k) | passage, question, answer, label | QA | F1/EM | various |
ReCoRD | 101k | 10k | 10k | source, text, entities, query, answers | QA | F1/EM | news |
RTE | 2.5k | 278 | 3k | premise, hypothesis, label | NLI | acc. | news, Wikipedia |
WiC | 6k | 638 | 1.4k | sentence1, sentence2, entities1, entities2, label | WSD | acc. | WordNet, VerbNet, Wiktionary |
WSC | 554 | 104 | 146 | text, entities, label | coref. | acc. | fiction books |
*Note that for MultiRC, we enumerated all combinations of (passage, question, answer) triplets in the dataset and the number of samples in the expanded format is recorded inside parenthesis.
Similar to GLUE, SuperGLUE has two diagnostic tasks to analyze the system performance on a broad range of linguistic phenomena. For more details, see SuperGLUE Diagnostic.
Dataset | #Samples | Columns | Metrics |
---|---|---|---|
Winogender | 356 | hypothesis, premise, label | Accuracy |
Broadcoverage | 1104 | label, sentence1, sentence2, logic | Matthews corr. |
We also provide the script to download a series of text classification datasets for the purpose of benchmarking. We select the classical datasets that are also used in Character-level Convolutional Networks for TextClassification, NeurIPS2015 and Funnel-Transformer: Filtering out SequentialRedundancy for Efficient Language Processing, Arxiv2020.
Dataset | #Train | #Test | Columns | Metrics |
---|---|---|---|---|
AG | 120,000 | 7,600 | content, label | acc |
IMDB | 25,000 | 25,000 | content, label | acc |
DBpedia | 560,000 | 70,000 | content, label | acc |
Yelp2 | 560,000 | 38,000 | content, label | acc |
Yelp5 | 650,000 | 50,000 | content, label | acc |
Amazon2 | 3,600,000 | 400,000 | content, label | acc |
Amazon5 | 3,000,000 | 65,0000 | content, label | acc |
To obtain the datasets, run:
nlp_data prepare_text_classification -t all