[v2] introduce AbsAnyClassification and refactor mieb classification by Samoed · Pull Request #2537 · embeddings-benchmark/mteb

Samoed · 2025-04-11T23:18:44Z

I've created AbsAnyClassification for classification tasks, by merging AbsTaskClassification and AbsTaskImageClassification. Also updated evaluator to support both modalities

Closes #2432

Results for openai/clip-vit-base-patch16. There is a small mismatch, because I've changed seeds a bit.

task_name	main_score (PR)	main_score (main)	eval_time (main)
CIFAR10	0.92078	0.91886	151.892
MNIST	0.92706	0.9183	143.583

Results for minishlab/potion-base-2M

task_name	main_score (PR)	main_score (main)
AmazonCounterfactualClassification	0.643178	0.643178
Banking77Classification	0.652857	0.652857
EmotionClassification	0.40185	0.40185
ImdbClassification	0.715456	0.715456
ToxicConversationsClassification	0.650537	0.650537

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

Copilot

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

mteb/abstasks/Image/AbsTaskImageClassification.py:63

Ensure that the new 'values_column_name' attribute is consistently documented across the class, updating any outdated references (like in comments or docstrings) that mention 'image_column_name'.

values_column_name: str = "image"

mteb/evaluation/evaluators/ClassificationEvaluator.py

isaac-chung

Thanks for the effort. However, this is not what we discussed in #2078 (comment). I'd prefer not to have more layers of abstraction in a refactor, to make it easier to maintain and newcomers to understand. I believe @KennethEnevoldsen shared the same view as well.

Samoed · 2025-04-12T11:58:08Z

But you're also said In my head, e.g. a common AbsTaskAnyClassification could support both text and image columns, AND any other modalities that we would introduce, like audio.. I can merge AbsTaskImageClassification with AbsTaskClassification and left only one AbsTaskAnyClassifiction

isaac-chung · 2025-04-12T12:00:15Z

I can merge AbsTaskImageClassification with AbsTaskClassification and left only one AbsTaskAnyClassification

Sounds good to me 😸 thanks!

Samoed · 2025-04-12T12:02:25Z

But there will be slight change in scores, e. g. ImageClassification using n_experiments: int = 5 and samples_per_label: int = 16, but Classification task using samples_per_label: int = 8, n_experiments: int = 10, so I'm not sure how to handle it other than like now, because currenlty it's almost AbsTaskAnyClassification

isaac-chung · 2025-04-12T12:05:49Z

Good find. Maybe we can go the following,

for the "Any" default, follow that of the text tasks (as there are a lot of them)
for the MIEB classification tasks, we patch at the task level to use n_experiments: int = 5 and samples_per_label: int = 16

Samoed · 2025-04-12T12:56:19Z

@isaac-chung I've changed to AbsAnyClassification task

isaac-chung

Amazing work so far, thanks for putting this together. Got a few questions. The rest looks good 🚀

mteb/evaluation/evaluators/ClassificationEvaluator.py

mteb/abstasks/AbsTaskAnyClassification.py

tests/test_tasks/test_metadata.py

mteb/abstasks/Image/AbsTaskImageClassification.py

mteb/abstasks/AbsTaskMultilabelClassification.py

isaac-chung

Looks good! I feel it's good to merge. @KennethEnevoldsen feel free to still review post-hoc

# Conflicts: # tests/test_benchmark/mock_tasks.py

KennethEnevoldsen

This looks great!

I few things, I would like to clarify, but overall I think it looks very good.

mteb/abstasks/AbsTaskAnyClassification.py

mteb/evaluation/evaluators/ClassificationEvaluator.py

KennethEnevoldsen · 2025-05-02T12:32:38Z

I am totally fine with merging this - tests seems to fail though,

Samoed · 2025-05-02T12:40:47Z

Yes, the issue was that I hadn’t updated the test to use the new stats format. I've now updated it, so you can clearly see how it will look. If this format looks good to you, I can update everything to use the new nested statistics format consistently. @KennethEnevoldsen

Samoed added 3 commits April 12, 2025 02:17

integrating ImageTasks

77286c3

update uploading

d69f798

update typing

fa570da

Samoed requested review from KennethEnevoldsen, Copilot and isaac-chung and removed request for KennethEnevoldsen, Copilot and isaac-chung April 12, 2025 11:19

Copilot AI reviewed Apr 12, 2025

View reviewed changes

mteb/evaluation/evaluators/ClassificationEvaluator.py Show resolved Hide resolved

Samoed marked this pull request as ready for review April 12, 2025 11:32

Samoed added 2 commits April 12, 2025 14:33

move sampling

030153d

refactor evaluators

974f0d7

isaac-chung requested changes Apr 12, 2025

View reviewed changes

add more metrics

4f42a9e

get K directly

584aefd

create AbsAnyClassification

9078cc2

Samoed requested a review from isaac-chung April 12, 2025 12:56

Samoed added 4 commits April 12, 2025 15:58

remove call from abc evaluator

c4af4cf

fix imports

e6ec63f

fix descriptive stats

b0b8839

update classification task handling in descriptive stats

4358f21

isaac-chung reviewed Apr 12, 2025

View reviewed changes

Samoed added 3 commits April 12, 2025 22:06

address review

bd55bef

add mieb metadata

6ede9ba

fix knn metrics

661a4fe

Samoed requested a review from isaac-chung April 13, 2025 06:43

isaac-chung changed the title ~~[v2] Refactor mieb classification~~ [v2] introduce AbsAnyClassification and refactor mieb classification Apr 13, 2025

isaac-chung approved these changes Apr 13, 2025

View reviewed changes

Merge branch 'v2.0.0' into refactor_mieb_classification

7886846

# Conflicts: # tests/test_benchmark/mock_tasks.py

KennethEnevoldsen reviewed Apr 15, 2025

View reviewed changes

use one classifier

3fad2db

Samoed requested a review from KennethEnevoldsen April 17, 2025 20:31

Samoed added the v2 label Apr 17, 2025

update descriptive stats calculation

0ebb379

Samoed mentioned this pull request Apr 30, 2025

[v2] Create AnySTS #2599

Merged

17 tasks

update statistics

aa04e29

Samoed added 2 commits May 2, 2025 16:29

Merge branch 'v2.0.0' into refactor_mieb_classification

91112ca

fix test

1ed3ef4

Samoed merged commit e01e8b1 into v2.0.0 May 3, 2025
8 checks passed

Samoed deleted the refactor_mieb_classification branch May 3, 2025 08:48

Samoed mentioned this pull request Jun 6, 2025

Noticable performance differences between v1 and v2 #2780

Closed

Samoed mentioned this pull request Jun 15, 2025

[v2] Refactor descriptive stats #2823

Merged

isaac-chung linked an issue Jul 6, 2025 that may be closed by this pull request

Refactor MIEB classification #2432

Closed

Conversation

Samoed commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Quality

Documentation

Testing

Adding datasets checklist

Adding a model checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Samoed commented Apr 12, 2025

Uh oh!

isaac-chung commented Apr 12, 2025

Uh oh!

Samoed commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Apr 12, 2025

Uh oh!

Samoed commented Apr 12, 2025

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen commented May 2, 2025

Uh oh!

Samoed commented May 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Samoed commented Apr 11, 2025 •

edited

Loading

Samoed commented Apr 12, 2025 •

edited

Loading