Skip to content

[v2] Create AnySTS#2599

Merged
Samoed merged 8 commits intov2.0.0from
refactor_sts
May 3, 2025
Merged

[v2] Create AnySTS#2599
Samoed merged 8 commits intov2.0.0from
refactor_sts

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Apr 30, 2025

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request migrates the STS tasks from the VisualSTS-specific abstraction to a more generic AnySTS abstraction. Key changes include:

  • Replacing inheritance from AbsTaskVisualSTS to AbsTaskAnySTS across task files.
  • Replacing the VisualSTSEvaluator with the new AnySTSEvaluator in both evaluators and evaluation initialization.
  • Updating the dataloader creation functions to support both image and text modalities.

Reviewed Changes

Copilot reviewed 58 out of 58 changed files in this pull request and generated no comments.

Show a summary per file
File Description
mteb/tasks/Image/VisualSTS/multilingual/STSBenchmarkMultilingualVisualSTS.py Changed base class from AbsTaskVisualSTS to AbsTaskAnySTS.
mteb/tasks/Image/VisualSTS/multilingual/STS17MultilingualVisualSTS.py Changed base class from AbsTaskVisualSTS to AbsTaskAnySTS.
mteb/tasks/Image/VisualSTS/en/*.py Updated all task files to use AbsTaskAnySTS and reordered all lists.
mteb/evaluation/evaluators/init.py & mteb/evaluation/evaluators/Image/*.py Removed VisualSTSEvaluator and updated to use AnySTSEvaluator.
mteb/evaluation/evaluators/AnySTSEvaluator.py Renamed and updated evaluator implementation to work with new dataloader API.
mteb/create_dataloaders.py Introduced create_dataloader function for unified image/text dataloader creation.
mteb/abstasks/*.py Replaced references to VisualSTS with the generic AnySTS approach.
Comments suppressed due to low confidence (1)

mteb/evaluation/evaluators/AnySTSEvaluator.py:59

  • The term 'manhatten' appears to be misspelled; consider renaming it to 'manhattan' for consistency.
manhatten_pearson, _ = pearsonr(self.gold_scores, manhattan_distances)

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I don't have much to add. Might want to spot check 1 text and 1 visual STS task each to confirm that scores didn't change.

@Samoed Samoed marked this pull request as ready for review May 1, 2025 06:50
@Samoed
Copy link
Member Author

Samoed commented May 1, 2025

Results openai/clip-vit-base-patch16:

task_name main_score (PR) main_score (v2) main_score (main)
STS16VisualSTS 0.689111 0.689135 0.689135

I think the difference is caused by pearson = pearsonr(self.gold_scores, similarity_scores). Previously, the function may have been incorrectly computed by using the full tuple, rather than just the correlation coefficient, as pearsonr does.

Results minishlab/potion-base-2M:

task_name main_score (PR) main_score (main)
STSBenchmark 0.72808 0.72808

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very solid not much too add. Great simplification

The copilot commment on Manhattan/manhatten is correct though - will you fix that

@Samoed Samoed requested a review from isaac-chung May 3, 2025 09:54
@Samoed Samoed merged commit e595082 into v2.0.0 May 3, 2025
8 checks passed
@Samoed Samoed deleted the refactor_sts branch May 3, 2025 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants