Machine translated multilingual STS benchmark dataset.
These are different multilingual translations and the English original of the STSbenchmark dataset. Translation has been done with deepl.com.
- Available languages are: de, en, es, fr, it, ja, nl, pl, pt, ru, zh
- Dataset splits are called: train, dev, test
It can be used to train sentence embeddings like T-Systems-onsite/cross-en-de-roberta-sentence-transformer.
Please open an issue if you have questions or want to report problems.
This dataset provides pairs of sentences and a score of their similarity.
| score | 2 example sentences | explanation |
|---|---|---|
| 5 | The bird is bathing in the sink. Birdie is washing itself in the water basin. |
The two sentences are completely equivalent, as they mean the same thing. |
| 4 | Two boys on a couch are playing video games. Two boys are playing a video game. |
The two sentences are mostly equivalent, but some unimportant details differ. |
| 3 | John said he is considered a witness but not a suspect. “He is not a suspect anymore.” John said. |
The two sentences are roughly equivalent, but some important information differs/missing. |
| 2 | They flew out of the nest in groups. They flew into the nest together. |
The two sentences are not equivalent, but share some details. |
| 1 | The woman is playing the violin. The young lady enjoys listening to the guitar. |
The two sentences are not equivalent, but are on the same topic. |
| 0 | The black dog is running through the snow. A race car driver is driving his car through the mud. |
The two sentences are completely dissimilar. |
- folder
raw-data: the raw data how it was convertet with deepl.com - folder
data: the data: sentence1, sentence2, similarity_score convert.py: script to convert data fromraw-datatodata
import csv
with open(filepath, newline="", encoding="utf-8") as csvfile:
csv_dict_reader = csv.DictReader(
csvfile,
dialect='excel',
fieldnames=["sentence1", "sentence2", "similarity_score"],
)
for row in csv_dict_reader:
print(row)none
| Language | 1st train | 1000st train | last train | 1st dev | 1000st dev | last dev | 1st test | 1000st test | last test |
|---|---|---|---|---|---|---|---|---|---|
| de | ok | ok | ok | ok | ok | ok | ok | ok | ok |
| en | ok | ok | ok | ok | ok | ok | ok | ok | ok |
| es | |||||||||
| fr | |||||||||
| it | |||||||||
| ja | |||||||||
| nl | ok | ok | partially English | ok | ok | ok | ok | ok | poor grammar |
| pl | |||||||||
| pt | |||||||||
| ru | |||||||||
| zh |