Skip to content

dataset: add human tasks and benchmark#3214

Merged
isaac-chung merged 29 commits intomainfrom
human_tasks
Oct 2, 2025
Merged

dataset: add human tasks and benchmark#3214
isaac-chung merged 29 commits intomainfrom
human_tasks

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Sep 25, 2025

Cherry-picked commits from #3213 from @AdnanElAssadi56

@Samoed Samoed mentioned this pull request Sep 25, 2025
@isaac-chung
Copy link
Collaborator

I'll add a benchmark object as well

@isaac-chung isaac-chung changed the title Add human tasks Add human tasks and benchmark Sep 27, 2025
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - a few minor changes

  1. should we add it to the leaderboard as well

  2. should we add a "human" model (otherwise the scores won't appear on the leaderboard). Should probably be filtered out by default.

  3. and 2) are probably for another PR

@Samoed
Copy link
Member Author

Samoed commented Sep 29, 2025

should we add a "human" model (otherwise the scores won't appear on the leaderboard). Should probably be filtered out by default.

Yes, I wanted to do like this

should we add it to the leaderboard as well

I will add after adding "human" model

Samoed and others added 2 commits September 29, 2025 12:21
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
# Combine: subsampled train + human test
self.dataset = DatasetDict(
{
"train": original_with_train["test"], # This becomes our training data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdnanElAssadi56 Why do we use test split from original dataset as train?

@Samoed Samoed changed the title Add human tasks and benchmark dataset: add human tasks and benchmark Sep 29, 2025
@Samoed Samoed added the new benchmark Issues related to adding a new benchmark label Sep 29, 2025
Copy link
Contributor

@AdnanElAssadi56 AdnanElAssadi56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I guess we'll stick with results from here and update paper.

@isaac-chung
Copy link
Collaborator

@AdnanElAssadi56 results on the LB are aggregated differently. Instead, we can explore a custom aggregation (needed in MIEB as well), as @Samoed suggested.

@Samoed
Copy link
Member Author

Samoed commented Oct 2, 2025

We can merge this, but without adding to the leaderboard

@isaac-chung
Copy link
Collaborator

We can merge this, but without adding to the leaderboard

Yeah. Let's do that in a separate PR.

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
@isaac-chung isaac-chung enabled auto-merge (squash) October 2, 2025 05:40
@isaac-chung isaac-chung merged commit 48a01fc into main Oct 2, 2025
11 checks passed
@isaac-chung isaac-chung deleted the human_tasks branch October 2, 2025 06:15
KennethEnevoldsen added a commit that referenced this pull request Oct 6, 2025
* fix: Correct metadata for ArguAna dataset (#3202)

* Update tasks & benchmarks tables

* 1.38.57

Automatically generated by python-semantic-release

* model: Add BMRetriever (#3195)

* model: Add BMRetriever

* Update mteb/models/bmretriever_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/bmretriever_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: remove trust_remote_code option

* feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper

* refactor: update training datasets for bmretriever

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Revert "Ci: test out GH models with welcoming new comers" (#3206)

Revert "Ci: test out GH models with welcoming new comers (#3112)"

This reverts commit 73a35e0.

* model: Add Codefuse models (#3205)

* add codefuse models

* add codefuse models

* Update codefuse_models.py

* lint codefuse.py

* fix(models): ensure prompt_type is passed to format_instruction (#3216)

* 1.38.58

Automatically generated by python-semantic-release

* Adding Cohere's output_dimension and embedding_type parameter (#3204)

* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8

* Correcting due to comments

* dataset: add swedish cpc patent classifications to mteb (#3072)

* feat: add swedish cpc patent classifications to mteb

* fix: formatting and init imports

* fix: update mteb task according to feedback

* fix: perform citation and code formatting

* fix: add train and test split for both datasets

* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)

* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior

* chore: fix colpali_models similarity  handle device

* Update tasks & benchmarks tables

* 1.38.59

Automatically generated by python-semantic-release

* fix: prevent EOS token truncation (#3218)

* fix(models): prevent EOS token truncation for BMRetriever

* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`

* fix(models): correct eos token handling in `BMRetrieverWrapper`

* 1.38.60

Automatically generated by python-semantic-release

* Update giga embeddings (#3210)

* update giga embeddings

* update giga embeddings

* 3b-september-2025

* fixed

* lint

* Update mteb/models/ru_sentence_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* change revision due to flash-attn dependency

* change apply_instruction_to_passages

---------

Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>

* fix: Refactor split create_tables into static Benchmark methods (#3126)

* feat - Split create_tables into static Benchmark methods

* feat - format

* Update mteb/leaderboard/table.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - remove search query;take benchmark result as input;addressing the circular import,

* feat - format

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - use to_dataframe;clean table.py;move creat_table

* feat - fix circular import

* feat - clean-up

* feat - format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 1.38.61

Automatically generated by python-semantic-release

* Extending the RTEB benchmark (#3223)

Adding another voyageai model

* Update tasks & benchmarks tables

* model: New qzmodel (#3211)

* Update qzhou_models.py

* Update qzhou_models.py

* reformat script code

* Update configuration

* According to our new decision, the model name has been changed to "QZhou-Embedding-Zh".

* Fix variable naming issues.

* model: Update Youtu embedding model (#3227)

* add youtu models

* add a blank line

* fix the optional dependencies and lint the code

* remove unused dependencies and reformat

* revise prompt_type

* update youtu_models

---------

Co-authored-by: springxchen <springxchen@tencent.com>

* dataset: Add Software Issue Localization Datasets (#3178)

* add software issue localization datasets

* add software issue localization datasets

* update and add multilingual datasets

* fix citation format issues

* Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py

* fix linting issues

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update tasks & benchmarks tables

* feat: Officially include RTEB in the leaderboard (#3222)

* feat - adjust Rteb's Benchmark

* feat - add blank

* fix menu names

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* moving around tasks

* fix: Update RTEB summary columns (#3226)

* fix(models): ensure prompt_type is passed to format_instruction (#3216)

* 1.38.58

Automatically generated by python-semantic-release

* Adding Cohere's output_dimension and embedding_type parameter (#3204)

* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8

* Correcting due to comments

* dataset: add swedish cpc patent classifications to mteb (#3072)

* feat: add swedish cpc patent classifications to mteb

* fix: formatting and init imports

* fix: update mteb task according to feedback

* fix: perform citation and code formatting

* fix: add train and test split for both datasets

* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)

* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior

* chore: fix colpali_models similarity  handle device

* Update tasks & benchmarks tables

* 1.38.59

Automatically generated by python-semantic-release

* fix: prevent EOS token truncation (#3218)

* fix(models): prevent EOS token truncation for BMRetriever

* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`

* fix(models): correct eos token handling in `BMRetrieverWrapper`

* 1.38.60

Automatically generated by python-semantic-release

* Update giga embeddings (#3210)

* update giga embeddings

* update giga embeddings

* 3b-september-2025

* fixed

* lint

* Update mteb/models/ru_sentence_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* change revision due to flash-attn dependency

* change apply_instruction_to_passages

---------

Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>

* fix: Refactor split create_tables into static Benchmark methods (#3126)

* feat - Split create_tables into static Benchmark methods

* feat - format

* Update mteb/leaderboard/table.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - remove search query;take benchmark result as input;addressing the circular import,

* feat - format

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - use to_dataframe;clean table.py;move creat_table

* feat - fix circular import

* feat - clean-up

* feat - format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 1.38.61

Automatically generated by python-semantic-release

* Extending the RTEB benchmark (#3223)

Adding another voyageai model

* Update tasks & benchmarks tables

* feat - filter_by_privacy

* feat - add new fields for rteb part

* feat - getattr

* feat - adjust privacy filter logic

* feat - enhance summary table column renaming and add 'is_public' field mapping

* fix: remove unused 'is_public' attribute from TaskResult

---------

Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>

* removed show_rteb args

* avoid defining function where we can just use the metadata

* minor fixes

* minor fixes

* fix: Correct logic for filtering public tasks in ModelResult class (#3230)

Co-authored-by: ethan <smiletoye@gmail.com>

---------

Co-authored-by: q275343119 <275343119@qq.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>

* Update tasks & benchmarks tables

* 1.39.0

Automatically generated by python-semantic-release

* fix: Add submission references for RTEB (#3233)

* fix: Add rteb submission references and improve descriptions.

* Added evaluation request

* added field for tasks

* 1.39.1

Automatically generated by python-semantic-release

* dataset: add human tasks and benchmark (#3214)

* Human Subsets Tasks

* Fixed Multilingual Classification Subset

* linting

* fix citations format

* make lint

* fix tests

* remove human folder

* fix relative imports

* add adapted_from for all human subsets

* fix pydantic errors

* add benchmark object

* make benchmark discoverable

* bibtex test

* Apply suggestion

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* rename & reupload

* upd tests

* upd tests again

* add model

* add benchmark to leaderboard

* change branch of leaderboard

* remove branch of load data

* fix model meta path

* make mteb importable

* update repo

* Update mteb/benchmarks/benchmarks/benchmarks.py

* Update mteb/leaderboard/benchmark_selector.py

* Update mteb/load_results/load_results.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com>

* Update tasks & benchmarks tables

* Remove 'HUME(v1)' from leaderboard benchmark (#3236)

* Remove 'HUME(v1)' from leaderboard benchmark

* lint

* docs: Update adding benchmark documentation (#3229)

* update adding_a_benchmark.md documentation

* fix numbers

* fix: Further specified macro-language code for Norwegian (#3228)

* fix: Further specified macro-language code for Norwegian

"nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this.

* furhter specified macro language "nor"

* Update tasks & benchmarks tables

* 1.39.2

Automatically generated by python-semantic-release

* fix max tokens (#3243)

* fix models

* fix imports

* fix task import

* reupload HUME tasks

* reupload SWE tasks

* add stats

* fix python39 transformers compatibility (#3254)

* fix python39 transformers

* fix

* Aggregate by subset for HUMEv1 (#3255)

aggregate by subset for HUMEv1

* Update tasks & benchmarks tables

* Fix AbsTaskTextRegression task (#3257)

Fix AbsTaskTextRegression

* Added Japanese to Retrieval (#3252)

* feat - add Japanese

* feat - use mteb.get_benchmark

* fix - 3.9 test error

* Revert "fix - 3.9 test error"

This reverts commit 6bfee53.

* fix - 3.9 test error

* Update tasks & benchmarks tables

* fix bm25 on small datasets (#3261)

---------

Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Geralt <94539084+Geralt-Targaryen@users.noreply.github.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Peter <51702222+PennyYu123@users.noreply.github.com>
Co-authored-by: spring-quan <38248619+spring-quan@users.noreply.github.com>
Co-authored-by: springxchen <springxchen@tencent.com>
Co-authored-by: Tarun Suresh <68882529+tarsur909@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: q275343119 <275343119@qq.com>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr>
Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com>
Co-authored-by: Niklas <n.muennighoff@gmail.com>
Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com>
Samoed added a commit that referenced this pull request Oct 22, 2025
* model: add image support for jina embeddings v4 (#2893)

* feat: unify text and image embeddings for all tasks

* fix: uniform batch size

* fix: update error message

* fix: update code task

* fix: update max length

* fix: apply review suggestions

* model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889)

* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

---------

Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>

* Add Classification Evaluator unit test (#2838)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: update colpali engine models (#2905)

* adding vidore benchmarks

* fix typo

* clean vidore names + per lang eval

* lint

* vidore names

* bibtex fix

* fix revision

* vidore v2 citation

* update citation format and fix per-language mappings

* lint: citations

* typo citations

* fix revisiions

* lint

* fix colnomic3b revision

* fix colqwen2.5 revision + latest repo version

* fix query agmentation tokens

* colsmol revision

* 1.38.35

Automatically generated by python-semantic-release

* Evaluator tests (#2910)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Adding STSEvaluator and SummarizationEvaluator tests

* Correcting due to the comments

* Correcting due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Classification dataset cleaning (#2900)

* Classification dataset cleaning

* Update pull request number

* Fix metadata test

* fix formatting

* add script for cleaning

* Update tasks & benchmarks tables

* dataset: Add JapaneseSentimentClassification (#2913)

Add JapaneseSentimentClassification

* Update tasks & benchmarks tables

* fix: change `passage` prompt to `document`  (#2912)

* change document to passage

* fix prompt names

* fix kwargs check

* fix default prompt

* 1.38.36

Automatically generated by python-semantic-release

* model: Add OpenSearch inf-free sparse encoding models (#2903)

add opensearch inf-free models

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* dataset: add BarExamQA dataset (#2916)

* Add BareExamQA retrieval task

* ran linter

* updated details

* updated details

* fixed subtype name

* fixed changes

* ran linter again

* Use `mteb.get_model` in adding_a_dataset.md (#2922)

Update adding_a_dataset.md

* fix: specify revision for opensearch (#2919)

specify revision for opensearch

* 1.38.37

Automatically generated by python-semantic-release

* Update the link for gemini-embedding-001 (#2928)

* fix: replace with passage (#2934)

* fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940)

* fix: Only import SparseEncoder once sentence-transformer version have been checked

fixes #2936

* Update mteb/models/opensearch_neural_sparse_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939)

The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.

* docs: Update adding_a_dataset.md (#2947)

* docs: Update adding_a_dataset.md

* Update docs/adding_a_dataset.md

* ci: bump semantic release

* 1.38.38

Automatically generated by python-semantic-release

* dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935)

* BSARD loader fixed

* BSARDv2 metadata fixed

* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tasks & benchmarks tables

* dataset: add GovReport dataset (#2953)

* Added govreport task

* Updated description

* dataset: add BillSum datasets (#2943)

* Added BillSum datasets

* fixed billsumca

* Updated BillSumCA description

* Updated BillSumUS description

* Update mteb/tasks/Retrieval/eng/BillSumCA.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/BillSumUS.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* lint

* lint

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716)

* Add RuSciBench

* fix bitext mining lang

* Add regression task

* fix init

* add missing files

* Improve description

* Add superseded_by

* fix lint

* Update regression task to match with v2

* Add stratified_subsampling for regression task

* Add boostrap for regression task

* Rename task class, add model as evaluator argument

* fix import

* fix import 2

* fixes

* fix

* Rename regression model protocol

* Update tasks & benchmarks tables

* 1.38.39

Automatically generated by python-semantic-release

* qzhou-embedding model_meta & implementation (#2975)

* qzhou-embedding model_meta & implementation

* Update qzhou_models.py

* Update qzhou_models.py

Processing todo items(Add default instruction)

* Update qzhou_models.py

correct bge datalist

* Update qzhou_models.py

correct 'public_training_data'

* Update qzhou_models.py

* Update qzhou_models.py

* Update qzhou_models.py

* Update qzhou_models.py

* Update mteb/models/qzhou_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/qzhou_models.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* format qzhou_models.py for ruff check

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* model: Add Voyage 3.5 model configuration (#3005)

Add Voyage 3.5 model configuration

- Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens
- Set release date to 2025-01-21 with revision 1
- Configure for cosine similarity with instruction support
- Include standard Voyage training datasets reference

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>

* model: BAAI/bge-m3-unsupervised Model (#3007)

* Add BAAI/bge-m3-unsupervised Model
(BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out)

* Remove the commented retromae model

---------

Co-authored-by: fzowl <zoltan@voyageai.com>

* lint: Correcting lint errors (#3004)

* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Correcting the lint errors

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* dataset: Added 50 Vietnamese dataset from vn-mteb (#2964)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [REMOVE] default fields metadata in Classfication tasks

* Update tasks & benchmarks tables

* model: Add Cohere embed-v4.0 model support (#3006)

* Add Cohere embed-v4.0 model support

- Add text-only embed-v4.0 model in cohere_models.py
- Add multimodal embed-v4.0 model in cohere_v.py
- Support configurable dimensions (256, 512, 1024, 1536)
- Support 128,000 token context length
- Support multimodal embedding (text, images, mixed PDFs)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add Cohere embed-v4.0 model support

Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Add OpenAI models with 512 dimension (#3008)

* Add OpenAI/text-embedding-3-small (512 dim)
Add OpenAI/text-embedding-3-large (512 dim)

* Correcting due to comments

---------

Co-authored-by: fzowl <zoltan@voyageai.com>

* Standardise task names and fix citation formatting (#3026)

fixes for name formatting

* Update tasks & benchmarks tables

* fix: Add missing training sets for qzhou (#3023)

* Supplement missing training sets

* reformat code

* Reorganize the data list format

* update qzhou_model meta

* 1.38.40

Automatically generated by python-semantic-release

* model: Add samilpwc_models meta (#3028)

* model: Add samilpwc_models meta

* Fix: Remove CONST

* Fix: Reformat File

* Update: model revision

* model: Add granite-vision-embedding model  (#3029)

* Add files via upload

* Address review comments

* Address review comments

* ruff format

* Update mteb/models/granite_vision_embedding_models.py

* lint error fix

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: incorrect revision for SNLRetrieval (#3033)

The provided revisions doesn't seem to be present on:
adrlau/navjordj-SNL_summarization_copy

Replacing with latest revision

* dataset: Add HumanEvalRetrieval task (#3022)

* Add HumanEvalRetrieval dataset

* Fix TaskMetadata structure and remove descriptive_stats

- Use TaskMetadata class instead of dict
- Remove descriptive_stats as requested in PR review
- Add date field and proper import structure

* Fix dataset path and use verified metadata

- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval
- Use actual description from HuggingFace dataset page
- Remove fabricated citation and reference
- Remove revision field that was incorrect
- Reference HuggingFace dataset page instead of arxiv

* Add correct revision hash to HumanEval

- Add revision hash: ed1f48a for reproducibility

* Fix HumanEval metadata validation

- Add date field for metadata completeness
- Add bibtex_citation field (empty string)
- Required for TaskMetadata validation to pass
- Should resolve PR test failure

* Address reviewer feedback

- Remove trust_remote_code parameter as requested
- Add revision parameter to load_dataset() calls for consistency
- Use metadata revision hash in dataset loading for reproducibility

* Fix field names in HumanEval dataset loading

Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.

* Fix deprecated metadata_dict usage

Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.

* Fix data structure for MTEB compatibility

- Organize data by splits as expected by MTEB retrieval tasks
- Convert scores to integers for pytrec_eval compatibility

* Address PR feedback for HumanEval dataset

- Add descriptive statistics using calculate_metadata_metrics()
- Enhance metadata description with dataset structure details
- Add complete BibTeX citation for original paper
- Update to full commit hash revision
- Add python-Code language tag for programming language
- Explain retrieval task formulation clearly

* Fix BibTeX citation formatting for HumanEvalRetrieval

- Update citation to match bibtexparser formatting requirements
- Fields now in alphabetical order with lowercase names
- Proper trailing commas and indentation

* Update tasks & benchmarks tables

* 1.38.41

Automatically generated by python-semantic-release

* ci: reduce parallel runs for when checking if a dataset exists (#3035)

The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831)

* ci: Updating rerun delays to prevent false positives errors

* ci: Updating rerun delays to prevent false positives errors

* model: Add GreenNode Vietnamese Embedding models (#2994)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] Vietnamese Embedding models

* [REMOVE] default fields metadata in Classfication tasks

* [UPDATE] model to vi-vn language specific file

* [FIX] lint

* [FIX] model loader

* model: add granite-embedding-english R2 models (#3050)

* fix: Updated revision for jina-embeddings-v4 (#3046)

* fix: jinav4 revision

Signed-off-by: admin <bo.wang@jina.ai>

* change revision instead of removing it

Signed-off-by: admin <bo.wang@jina.ai>

---------

Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: admin <bo.wang@jina.ai>

* 1.38.42

Automatically generated by python-semantic-release

* Fix 3 VN-MTEB Pair Classification tasks (#3053)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] Vietnamese Embedding models

* [REMOVE] default fields metadata in Classfication tasks

* [UPDATE] model to vi-vn language specific file

* [FIX] lint

* [FIX] model loader

* [FIX] VN-MTEB 3 datasets PairClassification rename column

* dataset: Add mbpp retrieval (#3037)

* Add MBPP retrieval task

- Code retrieval task based on 378 Python programming problems
- Natural language queries matched to Python code implementations
- Uses python-Code evaluation language for code-specific metrics
- Includes proper citations and descriptive statistics

* Add MBPPRetrieval to imports

* Add descriptive statistics for MBPPRetrieval

* Reformatting

* Reformatting

* Update tasks & benchmarks tables

* dataset: Added wikisql retrieval (#3039)

* Add WikiSQL retrieval task

- Code retrieval task based on WikiSQL natural language to SQL dataset
- Natural language questions matched to SQL query implementations
- Uses sql-Code evaluation language for SQL-specific metrics
- Includes proper citations and descriptive statistics

* Add WikiSQLRetrieval to imports

* Add descriptive statistics for WikiSQLRetrieval

* Reformatting

* Reformatting

* Reformatting, correcting the revision

* Update tasks & benchmarks tables

* ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors

try to fix CI

* fix MBPPRetrieval revision (#3055)

Update MBPPRetrieval.py

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: Add VN-MTEB benchmark and Leaderboard (#2995)

* [ADD] 50 vietnamese dataset from vn-mteb

* [UPDATE] task metadata

* [UPDATE] import dependencies

* [UPDATE] task metadata, bibtext citation

* [UPDATE-TEST] test_model_meta

* [UPDATE] sample_creation to machine-translated and LM verified

* [ADD] sample creation machine-translated and LM verified

* [ADD] VN-MTEB benchmark and leaderboard

* [FIX] wrong benchmark name

* [REMOVE] default fields metadata in Classfication tasks

* Update tasks & benchmarks tables

* 1.38.43

Automatically generated by python-semantic-release

* Add hc3finance retrieval (#3041)

* Add HC3Finance retrieval task

- Financial retrieval task based on HC3 Finance dataset
- Financial questions matched to human and AI-generated content
- Covers financial explanations, analysis, and educational content
- Includes proper citations and descriptive statistics

* Add HC3FinanceRetrieval to imports

* Add descriptive statistics for HC3FinanceRetrieval

* Reformatting

* Reformatting, correcting the revision

* Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Add finqa retrieval (#3042)

* Add FinQA retrieval task

- Financial numerical reasoning retrieval task based on FinQA dataset
- Numerical financial questions matched to relevant document data
- Covers earnings reports with tables and quantitative financial data
- Includes proper citations and descriptive statistics

* Add FinQARetrieval to imports

* Add descriptive statistics for FinQARetrieval

* Reformatting

* Reformatting

* Update mteb/tasks/Retrieval/eng/FinQARetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* Add FinanceBenchRetrieval task (#3044)

* Add FinanceBenchRetrieval

* Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Update tasks & benchmarks tables

* Add FreshStackRetrieval task (#3043)

* Add FreshStackRetrieval

* Reformatting, correcting the revision

* Dataset correction

* Update tasks & benchmarks tables

* dataset: Add ds1000 retrieval (#3038)

* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Update tasks & benchmarks tables

* Add ChatDoctorRetrieval (#3045)

* Add ChatDoctorRetrieval

* Reformatting, correcting the revision

* Correct the dataset citation

* Correcting due to comments

* Update tasks & benchmarks tables

* Correcting the (new) DS1000 dataset's revision (#3063)

* Add DS1000 retrieval task

- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries

* Add DS1000Retrieval to imports

* Add descriptive statistics for DS1000Retrieval

* Reformatting

* Reformatting

* Add DS1000Retrieval task implementation

* dataset: Add JinaVDR (#2942)

* feat: added jinavdr benchmark

* feat: added description for jinavdr

* feat: fixed licenses and added bibtex

* feat: made jinav4 compatible with vidore benchmark

* feat: corrected query numbers

* feat: removed print

* feat: added max pixel argument for jina models

* feat: score calculation on cpu

* feat: adjust jina model for new mteb code

* feat: code cleanup

* feat: corrected bibtex

* feat: make colpali run with jinavdr

* feat: fixed comments

* feat: better reference and fixed comments

* feat: added date for tasks

* feat: fixed missing metadata and bibtex

* feat: added descriptions per dataset

* Update tasks & benchmarks tables

* model: Add CoDi-Embedding-V1 (#3054)

* add codiemb-minicpm

* replace codiemb_minicpm with codi_model

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/codi_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update code

* update code

* reformat

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: ensure that there are always relevant docs attached to query (#3058)

* fix: ensure that there are always relevant docs attached to query

Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")

eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())

# before fix:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
before_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# with update:
res[0].get_score()  # np.float64(0.02837)
res[0].scores
with_fix = {
    "train": [
        {
            "ndcg_at_1": 0.02597,
            "ndcg_at_3": 0.02213,
            "ndcg_at_5": 0.0262,
            "ndcg_at_10": 0.02837,
            "ndcg_at_20": 0.04548,
            "ndcg_at_100": 0.13527,
            "ndcg_at_1000": 0.24507,
            "map_at_1": 0.00866,
            "map_at_3": 0.01317,
            "map_at_5": 0.0149,
            "map_at_10": 0.01562,
            "map_at_20": 0.01898,
            "map_at_100": 0.02968,
            "map_at_1000": 0.03841,
            "recall_at_1": 0.00866,
            "recall_at_3": 0.02056,
            "recall_at_5": 0.02922,
            "recall_at_10": 0.03355,
            "recall_at_20": 0.08268,
            "recall_at_100": 0.43766,
            "recall_at_1000": 1.0,
            "precision_at_1": 0.02597,
            "precision_at_3": 0.02165,
            "precision_at_5": 0.01818,
            "precision_at_10": 0.01039,
            "precision_at_20": 0.01234,
            "precision_at_100": 0.01481,
            "precision_at_1000": 0.0034,
            "mrr_at_1": 0.025974,
            "mrr_at_3": 0.041126,
            "mrr_at_5": 0.04632,
            "mrr_at_10": 0.048485,
            "mrr_at_20": 0.058356,
            "mrr_at_100": 0.070186,
            "mrr_at_1000": 0.071349,
            "nauc_ndcg_at_1_max": 0.33969,
            "nauc_ndcg_at_1_std": -0.202864,
            "nauc_ndcg_at_1_diff1": -0.127,
            "nauc_ndcg_at_3_max": 0.409376,
            "nauc_ndcg_at_3_std": -0.039352,
            "nauc_ndcg_at_3_diff1": -0.022816,
            "nauc_ndcg_at_5_max": 0.250499,
            "nauc_ndcg_at_5_std": -0.115263,
            "nauc_ndcg_at_5_diff1": -0.057017,
            "nauc_ndcg_at_10_max": 0.238696,
            "nauc_ndcg_at_10_std": -0.138396,
            "nauc_ndcg_at_10_diff1": -0.045287,
            "nauc_ndcg_at_20_max": 0.154456,
            "nauc_ndcg_at_20_std": -0.070635,
            "nauc_ndcg_at_20_diff1": 0.074499,
            "nauc_ndcg_at_100_max": -0.005753,
            "nauc_ndcg_at_100_std": -0.074738,
            "nauc_ndcg_at_100_diff1": -0.005851,
            "nauc_ndcg_at_1000_max": 0.109439,
            "nauc_ndcg_at_1000_std": -0.089797,
            "nauc_ndcg_at_1000_diff1": -0.021634,
            "nauc_map_at_1_max": 0.33969,
            "nauc_map_at_1_std": -0.202864,
            "nauc_map_at_1_diff1": -0.127,
            "nauc_map_at_3_max": 0.385244,
            "nauc_map_at_3_std": -0.080638,
            "nauc_map_at_3_diff1": -0.060991,
            "nauc_map_at_5_max": 0.294871,
            "nauc_map_at_5_std": -0.119069,
            "nauc_map_at_5_diff1": -0.06234,
            "nauc_map_at_10_max": 0.285698,
            "nauc_map_at_10_std": -0.132856,
            "nauc_map_at_10_diff1": -0.055015,
            "nauc_map_at_20_max": 0.236619,
            "nauc_map_at_20_std": -0.100673,
            "nauc_map_at_20_diff1": -0.002619,
            "nauc_map_at_100_max": 0.15345,
            "nauc_map_at_100_std": -0.138888,
            "nauc_map_at_100_diff1": -0.02257,
            "nauc_map_at_1000_max": 0.171402,
            "nauc_map_at_1000_std": -0.134644,
            "nauc_map_at_1000_diff1": -0.034477,
            "nauc_recall_at_1_max": 0.33969,
            "nauc_recall_at_1_std": -0.202864,
            "nauc_recall_at_1_diff1": -0.127,
            "nauc_recall_at_3_max": 0.375072,
            "nauc_recall_at_3_std": -0.009643,
            "nauc_recall_at_3_diff1": -0.089168,
            "nauc_recall_at_5_max": 0.147691,
            "nauc_recall_at_5_std": -0.128654,
            "nauc_recall_at_5_diff1": -0.084259,
            "nauc_recall_at_10_max": 0.141055,
            "nauc_recall_at_10_std": -0.165932,
            "nauc_recall_at_10_diff1": -0.060966,
            "nauc_recall_at_20_max": 0.043863,
            "nauc_recall_at_20_std": -0.028374,
            "nauc_recall_at_20_diff1": 0.157575,
            "nauc_recall_at_100_max": -0.157183,
            "nauc_recall_at_100_std": -0.019437,
            "nauc_recall_at_100_diff1": 0.013395,
            # "nauc_recall_at_1000_max": nan,
            # "nauc_recall_at_1000_std": nan,
            # "nauc_recall_at_1000_diff1": nan,
            "nauc_precision_at_1_max": 0.33969,
            "nauc_precision_at_1_std": -0.202864,
            "nauc_precision_at_1_diff1": -0.127,
            "nauc_precision_at_3_max": 0.406318,
            "nauc_precision_at_3_std": 0.007031,
            "nauc_precision_at_3_diff1": -0.034709,
            "nauc_precision_at_5_max": 0.178131,
            "nauc_precision_at_5_std": -0.112493,
            "nauc_precision_at_5_diff1": -0.045535,
            "nauc_precision_at_10_max": 0.167897,
            "nauc_precision_at_10_std": -0.150626,
            "nauc_precision_at_10_diff1": -0.027811,
            "nauc_precision_at_20_max": 0.081428,
            "nauc_precision_at_20_std": -0.042304,
            "nauc_precision_at_20_diff1": 0.17278,
            "nauc_precision_at_100_max": -0.150619,
            "nauc_precision_at_100_std": 0.016133,
            "nauc_precision_at_100_diff1": -0.065571,
            "nauc_precision_at_1000_max": -0.017244,
            "nauc_precision_at_1000_std": 0.046614,
            "nauc_precision_at_1000_diff1": -0.028258,
            "nauc_mrr_at_1_max": 0.33969,
            "nauc_mrr_at_1_std": -0.202864,
            "nauc_mrr_at_1_diff1": -0.127,
            "nauc_mrr_at_3_max": 0.409511,
            "nauc_mrr_at_3_std": -0.064671,
            "nauc_mrr_at_3_diff1": -0.01911,
            "nauc_mrr_at_5_max": 0.319584,
            "nauc_mrr_at_5_std": -0.103546,
            "nauc_mrr_at_5_diff1": -0.025109,
            "nauc_mrr_at_10_max": 0.309614,
            "nauc_mrr_at_10_std": -0.117564,
            "nauc_mrr_at_10_diff1": -0.019691,
            "nauc_mrr_at_20_max": 0.262976,
            "nauc_mrr_at_20_std": -0.092222,
            "nauc_mrr_at_20_diff1": 0.024507,
            "nauc_mrr_at_100_max": 0.256052,
            "nauc_mrr_at_100_std": -0.094249,
            "nauc_mrr_at_100_diff1": 0.012432,
            "nauc_mrr_at_1000_max": 0.260112,
            "nauc_mrr_at_1000_std": -0.098845,
            "nauc_mrr_at_1000_diff1": 0.009697,
            "main_score": 0.02837,
            "hf_subset": "default",
            "languages": ["dan-Latn"],
        }
    ]
}

# check
with_fix == before_fix  # True

* restructure

* format

* relax pytrec versions

* fix incorrect parsing

* 1.38.44

Automatically generated by python-semantic-release

* Correcting the JINA models with SentenceTransformerWrapper (#3071)

* ci: Add stale workflow (#3066)

* add stale workflow

* add permissions

* add bug label to bug issue template

* revert bug issue and only look at more info needed issues

* more accurate name

* override default

* fix: open_clip package validation (#3073)

* 1.38.45

Automatically generated by python-semantic-release

* fix: Update revision for  qzhou models (#3069)

* 1.38.46

Automatically generated by python-semantic-release

* Fix the reference link for CoDi-Embedding-V1 (#3075)

Fix reference link

* fix: Add beta version of RTEB related benchmarks (#3048)

* Add RTEB related benchmarks

* Add RTEB related benchmarks

* Correcting the task names in the RTEB benchmarks

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Adding the CURE dataset to RTEB benchmarks

* Use the right language subset

* Fix broken finance icon URL in RTEB benchmarks

Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg
Validated all icon URLs and confirmed accessibility compliance

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

* Add the rteb_benchmarks to the BENCHMARK_REGISTRY

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* 1.38.47

Automatically generated by python-semantic-release

* fix: run `ruff check` on all files during ci (#3086)

* fix: run `ruff check` on all files during ci

* format

* 1.38.48

Automatically generated by python-semantic-release

* Move dev to dependency groups (#3088)

add dependency groups

* fix: Improving validate_task_to_prompt_name logs and error messages (#3079)

* Improving validate_task_to_prompt_name logs and error messages

* linter fixes

* Adding None prompts tests

* Update test_benchmark_sentence_transformer

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: duplicate mteb multilingual variables (#3080)

* fix benchmark naming

* format

* lint

* Update tasks & benchmarks tables

* model: mdbr-leaf models (#3081)

* added MDBR leaf models

* fixed revision for mdbr-leaf-ir

* added model prompts

* updated training datasets

* fixed linting

* lotte task reference

---------

Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>

* 1.38.49

Automatically generated by python-semantic-release

* CI: Set upper limit for xdist version  (#3098)

* Commentout bibtex formatting

* Remove `-n auto`

* get back bibtex

* try limiting versions

* revert coverage

* revert coverage

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* Combine Plots and Tables into a Single (#3047)

* feat - Combine Plots and Tables into a Single Tab #3009

* feat - Resize the plot to make it more readable

* feat - Remove the (radar chart)

* feat - Add a comment stating that it only shows the Top 5 models in the table.

* feat - adjust layout

* Update mteb/leaderboard/app.py

* format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* fix: Updating the default batch size calculation in the voyage models (#3091)

* 1.38.50

Automatically generated by python-semantic-release

* fix: Add @classmethod for @field_validators in TaskMetadata  (#3100)

* Align task prompt dict with `PromptType` (#3101)

* align task prompt dict with `PromptType`

* use value instead of enum

* 1.38.51

Automatically generated by python-semantic-release

* model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090)

* Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1

* Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth)

* Format with ruff + add loader per review

* Apply ruff format/fixes

* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader)

* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix import

* Add memory_usage_mb=808.0 and required fields to ModelMeta

* Fix 210 milions of parameters

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: Allow closed datasets (#3059)

* - Added an include_private parameter to the get_tasks() function that defaults to False
  - This ensures that by default, tests only run on public datasets
  - Tests can explicitly set include_private=True when needed to test private datasets

  - Added is_public: bool | None = None field to TaskMetadata
  - The field is optional and defaults to None (treated as public)
  - Updated the is_filled() method to exclude is_public from required fields
  - Added documentation

* - Added an include_private parameter to the get_tasks() function that defaults to False
  - This ensures that by default, tests only run on public datasets
  - Tests can explicitly set include_private=True when needed to test private datasets

  - Added is_public: bool | None = None field to TaskMetadata
  - The field is optional and defaults to None (treated as public)
  - Updated the is_filled() method to exclude is_public from required fields
  - Added documentation

* Correcting due to comments

* Update mteb/abstasks/TaskMetadata.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/overview.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Removing the not used filter_tasks_by_privacy function

* Correcting due to comments

* Correcting due to comments

* Correcting due to comments

* Removing the test case

* Rename the include_private parameter to exclude_private

* Rename the include_private parameter to exclude_private

* Add private tasks tests

* Add private tasks tests

* Update tests/test_tasks/test_private_tasks.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Add private tasks tests

* Add private tasks tests

* Add private tasks tests

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 1.38.52

Automatically generated by python-semantic-release

* Ci: test out GH models with welcoming new comers (#3112)

test out GH models with welcoming new comers

* ci: Dataset check on new PR (#3103)

* add dataset check on new PR

* add extract datasets

* run as module

* update startswith

* update workflow name

* add GitPython

* export var

* same shell session

* address review comments

* add to docs to say what this script does

* add docs

* model: add Youtu-Embedding-V1 (#3115)

* add youtu models

* add a blank line

* fix the optional dependencies and lint the code

* remove unused dependencies and reformat

* revise prompt_type

---------

Co-authored-by: springxchen <springxchen@tencent.com>

* fix: add voyage quantization models (#3092)

* Adding quantization support

* Update mteb/models/voyage_models.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/model_meta.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/model_meta.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Simplifying the quantization/output_dtype

* Update mteb/model_meta.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* 1.38.53

Automatically generated by python-semantic-release

* model: EmbeddingGemma 300M (#3129)

* model: EmbeddingGemma 300M

* Add license and revision

* fix: Add dedicated display for RTEB benchmark results (#3089)

* feat - remove special filtering, keep zero-shot, keep borda rank

* feat - remove get_rteb_benchmark.py

* feat - delete get_rteb_benchmark.py;RTEB_BENCHMARK_ENTRIES changes

* feat -format

* Update mteb/load_results/benchmark_results.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update tasks & benchmarks tables

* 1.38.54

Automatically generated by python-semantic-release

* dataset: Add Dapfam patent retrieval tasks (#2946)

* chore: add 'Patent retrieval' subtype to TaskMetadata

* feat(retrieval): add DAPFAM patent retrieval tasks (+18 variants)

* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)

* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Changes :

- Added possibility to opt in or out of quantization through the "quantize" argument.
- Added possibility to compute raw dot product without normalization. (to reproduce the paper method the "similarity" argument should be "cosine").
- Removed unecessary function and overhauled the tasks descriptions to be more clear.

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Changes made :
- Overhauled task descriptions as well as naming to conform with the naming scheme of mteb retrieval tasks.
- Similarity is now computed using the similarity function of the passed model.
- Changed optional quantization method to conform with sentence transformers similarity function.

to reproduce the paper metrics, one can use the following snippet :

```python
from mteb import mteb
from sentence_transformers import SentenceTransformer

model_name = "Snowflake/snowflake-arctic-embed-m-v2.0"
model = SentenceTransformer(model_name,
                           model_kwargs={
                            "torch_dtype": "float16",
                            },
                           trust_remote_code=True,
                            ).cuda().eval()

tasks = mteb.get_tasks(tasks=[
    "DAPFAMInTitlAbsToTitlAbsClmRetrieval",
    "DAPFAMAllTitlAbsToTitlAbsClmRetrieval",
    "DAPFAMOutTitlAbsToTitlAbsClmRetrieval",
     add the other 3 remaining tasks ...
    ])

evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
		model,
		output_folder=f"mteb_res/{model_name}",
		quantize=True, # if set to false or not set, the obtained ndcg@10 and map@10 will be ~0.001 higher
		encode_kwargs= {"batch_size" : 32}
	)
```

* changed default value of quantization to false

* added the import to all DAPFAM tasks; tested that the  works; verified compliance with the checklist

* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* added revision numbers to all dataset loading operations as well as the metadata itself

* intermediate changes, refresh local branch

* intermediate changes, refresh local branch again

* scale back to standard evaluation with empty set exclusion, various cosmetic/formatting changes

* minor cosmetic/formatting changes

* fixed main metric to be ndcg_at_100 as in the paper

* removed old code artifacts from previous versions

* read appropriate loading arguments from task metadata, remove unecessary class attribute

* reformat bibtex ( remark on the assertion since it tries to match literal string instead of bibtex formatting, format inconsistent with arXiv default), fixed metadata, parameters read from task metadata, all tests passed

* refactor data loading to read from metadata class attributes

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update tasks & benchmarks tables

* Align max tokens (#3172)

* Correct the VoyageAI model's batch creation/batch size calculation (#3185)

Correct the batch creation

* dataset: Adding JapaneseCode1Retrieval as the first non-public dataset (#3168)

* Adding JapaneseCode1Retrieval as the first non-public dataset

* Transformed dataset

* Adding as private dataset to tests

* Correct the private task test

* Use the sample dataset as a reference

* Use the sample dataset as a reference

* fix ds loading

* allow on forks

* upd aciton

* remove paths

* try to trigger ci

* add ref

* add permissions

* remove paths

* add paths back

* get back to pull request

* rollback action

* Trying to resolve the token/secret problem

* Trying to resolve the token/secret problem

* Update dataset_loading_pr.yml

* Update dataset_loading_pr.yml

* Try the latest datasets package (worked for me)

* Try the latest datasets package (worked for me)

* Try the latest datasets package (worked for me)

* (last?) try

* (last?) try

* (last?) try

* Reverting the changes

* Exclude the private datasets from tests

* Apply suggestions from code review

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Solomatin Roman <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* fix: add version check for `embeddinggemma-300m` (#3189)

add version check

* dataset: Added a set of  closed datasets (#3186)

* Add 12 more closed datasets
Extend the RTEB benchmarks

* trust_remote_code

* trust_remote_code

* Enabling JapaneseCode1Retrieval in the RTEB benchmarks

* Add closed datasets as private tasks

* Correct due to the comment

* Update tasks & benchmarks tables

* fix: Edit ack & sponsors (#3187)

* dataset: Update FaMTEB to Version 2 (#3157)

* Update benchmark to version 2

* make others in benchmark selector one line code

* small changes

* update a few tasks metadata

* update faintent license with correct form

* remove redundant trust remote codes

* fix hardnegatives revision

* make lint

* fix errors

* apply suggestions

* fix citation problem

* add PR link to benchmark desc

* remove duplicate dataset names in mcinext_models

* update prompts

---------

Co-authored-by: mehran <mehan.sarmadi16@gmail.com>

* Update tasks & benchmarks tables

* 1.38.55

Automatically generated by python-semantic-release

* fix: Add conflicting dependencies to toml (#3191)

fix conflict dependencies

* 1.38.56

Automatically generated by python-semantic-release

* fix: Correct metadata for ArguAna dataset (#3202)

* Update tasks & benchmarks tables

* 1.38.57

Automatically generated by python-semantic-release

* model: Add BMRetriever (#3195)

* model: Add BMRetriever

* Update mteb/models/bmretriever_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/bmretriever_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: remove trust_remote_code option

* feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper

* refactor: update training datasets for bmretriever

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Revert "Ci: test out GH models with welcoming new comers" (#3206)

Revert "Ci: test out GH models with welcoming new comers (#3112)"

This reverts commit 73a35e0bb02e61108d50385f4c43fd7d1b16e984.

* model: Add Codefuse models (#3205)

* add codefuse models

* add codefuse models

* Update codefuse_models.py

* lint codefuse.py

* fix(models): ensure prompt_type is passed to format_instruction (#3216)

* 1.38.58

Automatically generated by python-semantic-release

* Adding Cohere's output_dimension and embedding_type parameter (#3204)

* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8

* Correcting due to comments

* dataset: add swedish cpc patent classifications to mteb (#3072)

* feat: add swedish cpc patent classifications to mteb

* fix: formatting and init imports

* fix: update mteb task according to feedback

* fix: perform citation and code formatting

* fix: add train and test split for both datasets

* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)

* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior

* chore: fix colpali_models similarity  handle device

* Update tasks & benchmarks tables

* 1.38.59

Automatically generated by python-semantic-release

* fix: prevent EOS token truncation (#3218)

* fix(models): prevent EOS token truncation for BMRetriever

* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`

* fix(models): correct eos token handling in `BMRetrieverWrapper`

* 1.38.60

Automatically generated by python-semantic-release

* Update giga embeddings (#3210)

* update giga embeddings

* update giga embeddings

* 3b-september-2025

* fixed

* lint

* Update mteb/models/ru_sentence_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* change revision due to flash-attn dependency

* change apply_instruction_to_passages

---------

Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>

* fix: Refactor split create_tables into static Benchmark methods (#3126)

* feat - Split create_tables into static Benchmark methods

* feat - format

* Update mteb/leaderboard/table.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - remove search query;take benchmark result as input;addressing the circular import,

* feat - format

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - use to_dataframe;clean table.py;move creat_table

* feat - fix circular import

* feat - clean-up

* feat - format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 1.38.61

Automatically generated by python-semantic-release

* Extending the RTEB benchmark (#3223)

Adding another voyageai model

* Update tasks & benchmarks tables

* model: New qzmodel (#3211)

* Update qzhou_models.py

* Update qzhou_models.py

* reformat script code

* Update configuration

* According to our new decision, the model name has been changed to "QZhou-Embedding-Zh".

* Fix variable naming issues.

* model: Update Youtu embedding model (#3227)

* add youtu models

* add a blank line

* fix the optional dependencies and lint the code

* remove unused dependencies and reformat

* revise prompt_type

* update youtu_models

---------

Co-authored-by: springxchen <springxchen@tencent.com>

* dataset: Add Software Issue Localization Datasets (#3178)

* add software issue localization datasets

* add software issue localization datasets

* update and add multilingual datasets

* fix citation format issues

* Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py

* fix linting issues

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update tasks & benchmarks tables

* feat: Officially include RTEB in the leaderboard (#3222)

* feat - adjust Rteb's Benchmark

* feat - add blank

* fix menu names

* Update mteb/leaderboard/benchmark_selector.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* moving around tasks

* fix: Update RTEB summary columns (#3226)

* fix(models): ensure prompt_type is passed to format_instruction (#3216)

* 1.38.58

Automatically generated by python-semantic-release

* Adding Cohere's output_dimension and embedding_type parameter (#3204)

* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8

* Correcting due to comments

* dataset: add swedish cpc patent classifications to mteb (#3072)

* feat: add swedish cpc patent classifications to mteb

* fix: formatting and init imports

* fix: update mteb task according to feedback

* fix: perform citation and code formatting

* fix: add train and test split for both datasets

* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)

* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior

* chore: fix colpali_models similarity  handle device

* Update tasks & benchmarks tables

* 1.38.59

Automatically generated by python-semantic-release

* fix: prevent EOS token truncation (#3218)

* fix(models): prevent EOS token truncation for BMRetriever

* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`

* fix(models): correct eos token handling in `BMRetrieverWrapper`

* 1.38.60

Automatically generated by python-semantic-release

* Update giga embeddings (#3210)

* update giga embeddings

* update giga embeddings

* 3b-september-2025

* fixed

* lint

* Update mteb/models/ru_sentence_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* change revision due to flash-attn dependency

* change apply_instruction_to_passages

---------

Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>

* fix: Refactor split create_tables into static Benchmark methods (#3126)

* feat - Split create_tables into static Benchmark methods

* feat - format

* Update mteb/leaderboard/table.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - remove search query;take benchmark result as input;addressing the circular import,

* feat - format

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/benchmarks/benchmark.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* feat - use to_dataframe;clean table.py;move creat_table

* feat - fix circular import

* feat - clean-up

* feat - format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 1.38.61

Automatically generated by python-semantic-release

* Extending the RTEB benchmark (#3223)

Adding another voyageai model

* Update tasks & benchmarks tables

* feat - filter_by_privacy

* feat - add new fields for rteb part

* feat - getattr

* feat - adjust privacy filter logic

* feat - enhance summary table column renaming and add 'is_public' field mapping

* fix: remove unused 'is_public' attribute from TaskResult

---------

Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>

* removed show_rteb args

* avoid defining function where we can just use the metadata

* minor fixes

* minor fixes

* fix: Correct logic for filtering public tasks in ModelResult class (#3230)

Co-authored-by: ethan <smiletoye@gmail.com>

---------

Co-authored-by: q275343119 <275343119@qq.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>

* Update tasks & benchmarks tables

* 1.39.0

Automatically generated by python-semantic-release

* fix: Add submission references for RTEB (#3233)

* fix: Add rteb submission references and improve descriptions.

* Added evaluation request

* added field for tasks

* 1.39.1

Automatically generated by python-semantic-release

* dataset: add human tasks and benchmark (#3214)

* Human Subsets Tasks

* Fixed Multilingual Classification Subset

* linting

* fix citations format

* make lint

* fix tests

* remove human folder

* fix relative imports

* add adapted_from for all human subsets

* fix pydantic errors

* add benchmark object

* make benchmark discoverable

* bibtex test

* Apply suggestion

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* rename & reupload

* upd tests

* upd tests again

* add model

* add benchmark to leaderboard

* change branch of leaderboard

* remove branch of load data

* fix model meta path

* make mteb importable

* update repo

* Update mteb/benchmarks/benchmarks/benchmarks.py

* Update mteb/leaderboard/benchmark_selector.py

* Update mteb/load_results/load_results.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com>

* Update tasks & benchmarks tables

* Remove 'HUME(v1)' from leaderboard benchmark (#3236)

* Remove 'HUME(v1)' from leaderboard benchmark

* lint

* docs: Update adding benchmark documentation (#3229)

* update adding_a_benchmark.md documentation

* fix numbers

* fix: Further specified macro-language code for Norwegian (#3228)

* fix: Further specified macro-language code for Norwegian

"nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this.

* furhter specified macro language "nor"

* Update tasks & benchmarks tables

* 1.39.2

Automatically generated by python-semantic-release

* fix max tokens (#3243)

* fix python39 transformers compatibility (#3254)

* fix python39 transformers

* fix

* Aggregate by subset for HUMEv1 (#3255)

aggregate by subset for HUMEv1

* Update tasks & benchmarks tables

* Fix AbsTaskTextRegression task (#3257)

Fix AbsTaskTextRegression

* Added Japanese to Retrieval (#3252)

* feat - add Japanese

* feat - use mteb.get_benchmark

* fix - 3.9 test error

* Revert "fix - 3.9 test error"

This reverts commit 6bfee53cff48304cc22d8248aa275dcc9e385475.

* fix - 3.9 test error

* Update tasks & benchmarks tables

* fix bm25 on small datasets (#3261)

* fix: Move zero-shot percentage calculation to the end of summary (#3231)

* Refactor: Move zero-shot percentage calculation to the end of summary table creation which only apply to RTEB table.

* Update RTEB benchmark name from "RTEB(beta)" to "RTEB" for consistency in display.

* feat - RTEB(beta)

* feat - remove Zero-shot

---------

Co-authored-by: ethan <smiletoye@gmail.com>

* model: Add ReasonIR (#3221)

* model: Add ReasonIR

* Update mteb/models/reasonir_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/reasonir_model.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update n_parameters of ReasonIR

Co-authored-by: Niklas <n.muennighoff@gmail.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Niklas <n.muennighoff@gmail.com>

* fix: Only pin model name and rank (#3263)

Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column

* 1.39.3

Automatically generated by python-semantic-release

* fix: resolve flash-attention dependency issue (#3265)

* fix: Only pin model name and rank

Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column

* fix: resolve flash-attention dependency issue

This has been tested and works.

fixed Resolve flash-attention dependency issues
Fixes #3240

* 1.39.4

Automatically generated by python-semantic-release

* fix: Add retry and token counting in Cohere models (#3253)

* Retry and token counting in Cohere models

* Retry and token counting in Cohere models

* Retry and token counting in Cohere models

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* 1.39.5

Automatically generated by python-semantic-release

* Align MIEB leaderboards with paper (#3272)

* sort by mean task type and use pure rank for MIEB LBs

* lint

* rename task type column for readability

* fix: add prompt for MIRACLRetrievalHardNegatives (#3266)

* add prompt for MIRACLRetrievalHardNegatives

* add `MIRACLRetrievalHardNegatives.v2`

* Update mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* move common metadata to dict

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tasks & benchmarks tables

* Add Regression task mock (#3271)

* 1.39.6

Automatically generated by python-semantic-release

* fix: Change language for task SlovakMovieReviewSentimentClassification (#3296)

* Update tasks & benchmarks tables

* 1.39.7

Automatically generated by python-semantic-release

* Add english code retriever model (#3302)

* Add en code retriever model

* fix model_name

* Update mteb/models/en_code_retriever.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* correct lint

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* docs: fix typos in `docs/adding_a_benchmark.md` (#3344)

* BREAKING: v2.0.0 (#1433)

* [v2] Merge…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new benchmark Issues related to adding a new benchmark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants