Skip to content

[v2] introduce AbsAnyClassification and refactor mieb classification#2537

Merged
Samoed merged 21 commits intov2.0.0from
refactor_mieb_classification
May 3, 2025
Merged

[v2] introduce AbsAnyClassification and refactor mieb classification#2537
Samoed merged 21 commits intov2.0.0from
refactor_mieb_classification

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Apr 11, 2025

I've created AbsAnyClassification for classification tasks, by merging AbsTaskClassification and AbsTaskImageClassification. Also updated evaluator to support both modalities

Closes #2432

Results for openai/clip-vit-base-patch16. There is a small mismatch, because I've changed seeds a bit.

task_name main_score (PR) main_score (main) eval_time (main)
CIFAR10 0.92078 0.91886 151.892
MNIST 0.92706 0.9183 143.583

Results for minishlab/potion-base-2M

task_name main_score (PR) main_score (main)
AmazonCounterfactualClassification 0.643178 0.643178
Banking77Classification 0.652857 0.652857
EmotionClassification 0.40185 0.40185
ImdbClassification 0.715456 0.715456
ToxicConversationsClassification 0.650537 0.650537

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@Samoed Samoed requested review from KennethEnevoldsen, Copilot and isaac-chung and removed request for KennethEnevoldsen, Copilot and isaac-chung April 12, 2025 11:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

mteb/abstasks/Image/AbsTaskImageClassification.py:63

  • Ensure that the new 'values_column_name' attribute is consistently documented across the class, updating any outdated references (like in comments or docstrings) that mention 'image_column_name'.
values_column_name: str = "image"

@Samoed Samoed marked this pull request as ready for review April 12, 2025 11:32
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort. However, this is not what we discussed in #2078 (comment). I'd prefer not to have more layers of abstraction in a refactor, to make it easier to maintain and newcomers to understand. I believe @KennethEnevoldsen shared the same view as well.

@Samoed
Copy link
Member Author

Samoed commented Apr 12, 2025

But you're also said In my head, e.g. a common AbsTaskAnyClassification could support both text and image columns, AND any other modalities that we would introduce, like audio.. I can merge AbsTaskImageClassification with AbsTaskClassification and left only one AbsTaskAnyClassifiction

@isaac-chung
Copy link
Collaborator

I can merge AbsTaskImageClassification with AbsTaskClassification and left only one AbsTaskAnyClassification

Sounds good to me 😸 thanks!

@Samoed
Copy link
Member Author

Samoed commented Apr 12, 2025

But there will be slight change in scores, e. g. ImageClassification using n_experiments: int = 5 and samples_per_label: int = 16, but Classification task using samples_per_label: int = 8, n_experiments: int = 10, so I'm not sure how to handle it other than like now, because currenlty it's almost AbsTaskAnyClassification

@isaac-chung
Copy link
Collaborator

Good find. Maybe we can go the following,

  • for the "Any" default, follow that of the text tasks (as there are a lot of them)
  • for the MIEB classification tasks, we patch at the task level to use n_experiments: int = 5 and samples_per_label: int = 16

@Samoed
Copy link
Member Author

Samoed commented Apr 12, 2025

@isaac-chung I've changed to AbsAnyClassification task

@Samoed Samoed requested a review from isaac-chung April 12, 2025 12:56
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work so far, thanks for putting this together. Got a few questions. The rest looks good 🚀

@Samoed Samoed requested a review from isaac-chung April 13, 2025 06:43
@isaac-chung isaac-chung changed the title [v2] Refactor mieb classification [v2] introduce AbsAnyClassification and refactor mieb classification Apr 13, 2025
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I feel it's good to merge. @KennethEnevoldsen feel free to still review post-hoc

# Conflicts:
#	tests/test_benchmark/mock_tasks.py
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I few things, I would like to clarify, but overall I think it looks very good.

@Samoed Samoed added the v2 label Apr 17, 2025
@Samoed Samoed mentioned this pull request Apr 30, 2025
17 tasks
@KennethEnevoldsen
Copy link
Contributor

I am totally fine with merging this - tests seems to fail though,

@Samoed
Copy link
Member Author

Samoed commented May 2, 2025

Yes, the issue was that I hadn’t updated the test to use the new stats format. I've now updated it, so you can clearly see how it will look. If this format looks good to you, I can update everything to use the new nested statistics format consistently. @KennethEnevoldsen

@Samoed Samoed merged commit e01e8b1 into v2.0.0 May 3, 2025
8 checks passed
@Samoed Samoed deleted the refactor_mieb_classification branch May 3, 2025 08:48
@isaac-chung isaac-chung linked an issue Jul 6, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor MIEB classification

4 participants