[v2] Create AnySTS by Samoed · Pull Request #2599 · embeddings-benchmark/mteb

Samoed · 2025-04-30T07:04:30Z

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

Copilot

Pull Request Overview

This pull request migrates the STS tasks from the VisualSTS-specific abstraction to a more generic AnySTS abstraction. Key changes include:

Replacing inheritance from AbsTaskVisualSTS to AbsTaskAnySTS across task files.
Replacing the VisualSTSEvaluator with the new AnySTSEvaluator in both evaluators and evaluation initialization.
Updating the dataloader creation functions to support both image and text modalities.

Reviewed Changes

Copilot reviewed 58 out of 58 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
mteb/tasks/Image/VisualSTS/multilingual/STSBenchmarkMultilingualVisualSTS.py	Changed base class from AbsTaskVisualSTS to AbsTaskAnySTS.
mteb/tasks/Image/VisualSTS/multilingual/STS17MultilingualVisualSTS.py	Changed base class from AbsTaskVisualSTS to AbsTaskAnySTS.
mteb/tasks/Image/VisualSTS/en/*.py	Updated all task files to use AbsTaskAnySTS and reordered all lists.
mteb/evaluation/evaluators/init.py & mteb/evaluation/evaluators/Image/*.py	Removed VisualSTSEvaluator and updated to use AnySTSEvaluator.
mteb/evaluation/evaluators/AnySTSEvaluator.py	Renamed and updated evaluator implementation to work with new dataloader API.
mteb/create_dataloaders.py	Introduced create_dataloader function for unified image/text dataloader creation.
mteb/abstasks/*.py	Replaced references to VisualSTS with the generic AnySTS approach.

Comments suppressed due to low confidence (1)

mteb/evaluation/evaluators/AnySTSEvaluator.py:59

The term 'manhatten' appears to be misspelled; consider renaming it to 'manhattan' for consistency.

manhatten_pearson, _ = pearsonr(self.gold_scores, manhattan_distances)

isaac-chung

Very nice! I don't have much to add. Might want to spot check 1 text and 1 visual STS task each to confirm that scores didn't change.

mteb/abstasks/AbsTaskAnySTS.py

Samoed · 2025-05-01T06:52:03Z

Results openai/clip-vit-base-patch16:

task_name	main_score (PR)	main_score (v2)	main_score (main)
STS16VisualSTS	0.689111	0.689135	0.689135

I think the difference is caused by pearson = pearsonr(self.gold_scores, similarity_scores). Previously, the function may have been incorrectly computed by using the full tuple, rather than just the correlation coefficient, as pearsonr does.

Results minishlab/potion-base-2M:

task_name	main_score (PR)	main_score (main)
STSBenchmark	0.72808	0.72808

KennethEnevoldsen

Looks very solid not much too add. Great simplification

The copilot commment on Manhattan/manhatten is correct though - will you fix that

# Conflicts: # mteb/abstasks/__init__.py # mteb/tasks/__init__.py # tests/test_benchmark/mock_tasks.py

Samoed added 2 commits April 29, 2025 23:16

start integration any sts

af6d4e3

update naming

cc6453c

Samoed added the v2 label Apr 30, 2025

Samoed requested review from KennethEnevoldsen and isaac-chung April 30, 2025 07:04

isaac-chung requested a review from Copilot April 30, 2025 13:57

Copilot AI reviewed Apr 30, 2025

View reviewed changes

isaac-chung reviewed Apr 30, 2025

View reviewed changes

mteb/abstasks/AbsTaskAnySTS.py Outdated Show resolved Hide resolved

Samoed marked this pull request as ready for review May 1, 2025 06:50

KennethEnevoldsen approved these changes May 2, 2025

View reviewed changes

Samoed added 5 commits May 3, 2025 11:51

Merge branch 'v2.0.0' into refactor_sts

59e83ed

# Conflicts: # mteb/abstasks/__init__.py # mteb/tasks/__init__.py # tests/test_benchmark/mock_tasks.py

Merge branch 'v2.0.0' into refactor_sts

868094b

Merge branch 'v2.0.0' into refactor_sts

136f27b

update statistics

e98839c

Merge branch 'v2.0.0' into refactor_sts

249405d

Samoed requested a review from isaac-chung May 3, 2025 09:54

isaac-chung approved these changes May 3, 2025

View reviewed changes

update statistics

8b0f159

Samoed merged commit e595082 into v2.0.0 May 3, 2025
8 checks passed

Samoed deleted the refactor_sts branch May 3, 2025 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Create AnySTS#2599

[v2] Create AnySTS#2599
Samoed merged 8 commits intov2.0.0from
refactor_sts

Samoed commented Apr 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

isaac-chung left a comment

Uh oh!

Uh oh!

Samoed commented May 1, 2025 •

edited

Loading

Uh oh!

KennethEnevoldsen left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Samoed commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Quality

Documentation

Testing

Adding datasets checklist

Adding a model checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samoed commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Samoed commented Apr 30, 2025 •

edited

Loading

Samoed commented May 1, 2025 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading