Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,28 @@ results = mteb.load_results(models=models, tasks=tasks)
df = results_to_dataframe(results)
```

</details>


<details>
<summary> Annotate Contamination in the training data of a model </summary>

### Annotate Contamination

have your found contamination in the training data of a model? Please let us know, either by opening an issue or ideally by submitting a PR
annotatig the training datasets of the model:

```py
model_w_contamination = ModelMeta(
name = "model-with-contamination"
...
training_datasets: {"ArguAna": # name of dataset within MTEB
["test"]} # the splits that have been trained on
...
)
```


</details>

<details>
Expand Down
7 changes: 4 additions & 3 deletions mteb/model_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,9 @@ class ModelMeta(BaseModel):
in the Latin script.
use_instructions: Whether the model uses instructions E.g. for prompt-based models. This also include models that require a specific format for
input such as "query: {document}" or "passage: {document}".
zero_shot_benchmarks: A list of benchmarks on which the model has been evaluated in a zero-shot setting. By default we assume that all models
are evaluated non-zero-shot unless specified otherwise.
training_datasets: A dictionary of datasets that the model was trained on. Names should be names as their appear in `mteb` for example
{"ArguAna": ["test"]} if the model is trained on the ArguAna test set. This field is used to determine if a model generalizes zero-shot to
a benchmark as well as mark dataset contaminations.
adapted_from: Name of the model from which this model is adapted from. For quantizations, fine-tunes, long doc extensions, etc.
superseded_by: Name of the model that supersedes this model, e.g. nvidia/NV-Embed-v2 supersedes v1.
"""
Expand All @@ -97,7 +98,7 @@ class ModelMeta(BaseModel):
reference: STR_URL | None = None
similarity_fn_name: DISTANCE_METRICS | None = None
use_instructions: bool | None = None
zero_shot_benchmarks: list[str] | None = None
training_datasets: dict[str, list[str]] | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the dataset name be validated somehow, e.g. must match with one of the existing dataset names within MTEB?

If so, where should that validation be?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is hard to validate because there might be datasets intended only for training that overlap with some tasks

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed By validation, I'm referring whether the entered dataset name is from the list of tasks in MTEB. Right now, this field accepts any dataset name, e.g. "ABCDE", but such a dataset does not exist in MTEB, and I feel that should an invalid entry in the model meta. Could you elaborate what "that overlap with some tasks" means?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if a model is trained on the NLLB dataset, some bitext mining datasets might potentially leak into the training data, but it's hard to determine which ones. Similarly, if a model is trained on Wikipedia, many datasets could have overlapping examples in the training data. Identifying these overlaps would require extensive testing to pinpoint where the leaks occur.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have validation, though I could imagine that people would also add datasets that are not in mteb here

adapted_from: str | None = None
superseded_by: str | None = None

Expand Down