dataset: add human tasks and benchmark#3214
Conversation
mteb/tasks/Classification/multilingual/MultilingualSentimentClassificationHumanSubset.py
Outdated
Show resolved
Hide resolved
mteb/tasks/Reranking/eng/human/Robust04InstructionRerankingHumanSubset.py
Outdated
Show resolved
Hide resolved
mteb/tasks/Classification/eng/EmotionClassificationHumanSubset.py
Outdated
Show resolved
Hide resolved
|
I'll add a benchmark object as well |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Looks good - a few minor changes
-
should we add it to the leaderboard as well
-
should we add a "human" model (otherwise the scores won't appear on the leaderboard). Should probably be filtered out by default.
-
and 2) are probably for another PR
mteb/tasks/Classification/eng/EmotionClassificationHumanSubset.py
Outdated
Show resolved
Hide resolved
Yes, I wanted to do like this
I will add after adding "human" model |
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
| # Combine: subsampled train + human test | ||
| self.dataset = DatasetDict( | ||
| { | ||
| "train": original_with_train["test"], # This becomes our training data |
There was a problem hiding this comment.
@AdnanElAssadi56 Why do we use test split from original dataset as train?
AdnanElAssadi56
left a comment
There was a problem hiding this comment.
Looks good to me! I guess we'll stick with results from here and update paper.
|
@AdnanElAssadi56 results on the LB are aggregated differently. Instead, we can explore a custom aggregation (needed in MIEB as well), as @Samoed suggested. |
|
We can merge this, but without adding to the leaderboard |
Yeah. Let's do that in a separate PR. |
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: Correct metadata for ArguAna dataset (#3202) * Update tasks & benchmarks tables * 1.38.57 Automatically generated by python-semantic-release * model: Add BMRetriever (#3195) * model: Add BMRetriever * Update mteb/models/bmretriever_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/bmretriever_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: remove trust_remote_code option * feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper * refactor: update training datasets for bmretriever --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Revert "Ci: test out GH models with welcoming new comers" (#3206) Revert "Ci: test out GH models with welcoming new comers (#3112)" This reverts commit 73a35e0. * model: Add Codefuse models (#3205) * add codefuse models * add codefuse models * Update codefuse_models.py * lint codefuse.py * fix(models): ensure prompt_type is passed to format_instruction (#3216) * 1.38.58 Automatically generated by python-semantic-release * Adding Cohere's output_dimension and embedding_type parameter (#3204) * Adding Cohere's output_dimension and embedding_type parameter Cohere's embed-v4 binary and int8 * Correcting due to comments * dataset: add swedish cpc patent classifications to mteb (#3072) * feat: add swedish cpc patent classifications to mteb * fix: formatting and init imports * fix: update mteb task according to feedback * fix: perform citation and code formatting * fix: add train and test split for both datasets * fix: AttributeError in ColPaliEngineWrapper similarity method (#3177) * fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior * chore: fix colpali_models similarity handle device * Update tasks & benchmarks tables * 1.38.59 Automatically generated by python-semantic-release * fix: prevent EOS token truncation (#3218) * fix(models): prevent EOS token truncation for BMRetriever * refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper` * fix(models): correct eos token handling in `BMRetrieverWrapper` * 1.38.60 Automatically generated by python-semantic-release * Update giga embeddings (#3210) * update giga embeddings * update giga embeddings * 3b-september-2025 * fixed * lint * Update mteb/models/ru_sentence_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * change revision due to flash-attn dependency * change apply_instruction_to_passages --------- Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> * fix: Refactor split create_tables into static Benchmark methods (#3126) * feat - Split create_tables into static Benchmark methods * feat - format * Update mteb/leaderboard/table.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - remove search query;take benchmark result as input;addressing the circular import, * feat - format * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - use to_dataframe;clean table.py;move creat_table * feat - fix circular import * feat - clean-up * feat - format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.61 Automatically generated by python-semantic-release * Extending the RTEB benchmark (#3223) Adding another voyageai model * Update tasks & benchmarks tables * model: New qzmodel (#3211) * Update qzhou_models.py * Update qzhou_models.py * reformat script code * Update configuration * According to our new decision, the model name has been changed to "QZhou-Embedding-Zh". * Fix variable naming issues. * model: Update Youtu embedding model (#3227) * add youtu models * add a blank line * fix the optional dependencies and lint the code * remove unused dependencies and reformat * revise prompt_type * update youtu_models --------- Co-authored-by: springxchen <springxchen@tencent.com> * dataset: Add Software Issue Localization Datasets (#3178) * add software issue localization datasets * add software issue localization datasets * update and add multilingual datasets * fix citation format issues * Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py * fix linting issues --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update tasks & benchmarks tables * feat: Officially include RTEB in the leaderboard (#3222) * feat - adjust Rteb's Benchmark * feat - add blank * fix menu names * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * moving around tasks * fix: Update RTEB summary columns (#3226) * fix(models): ensure prompt_type is passed to format_instruction (#3216) * 1.38.58 Automatically generated by python-semantic-release * Adding Cohere's output_dimension and embedding_type parameter (#3204) * Adding Cohere's output_dimension and embedding_type parameter Cohere's embed-v4 binary and int8 * Correcting due to comments * dataset: add swedish cpc patent classifications to mteb (#3072) * feat: add swedish cpc patent classifications to mteb * fix: formatting and init imports * fix: update mteb task according to feedback * fix: perform citation and code formatting * fix: add train and test split for both datasets * fix: AttributeError in ColPaliEngineWrapper similarity method (#3177) * fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior * chore: fix colpali_models similarity handle device * Update tasks & benchmarks tables * 1.38.59 Automatically generated by python-semantic-release * fix: prevent EOS token truncation (#3218) * fix(models): prevent EOS token truncation for BMRetriever * refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper` * fix(models): correct eos token handling in `BMRetrieverWrapper` * 1.38.60 Automatically generated by python-semantic-release * Update giga embeddings (#3210) * update giga embeddings * update giga embeddings * 3b-september-2025 * fixed * lint * Update mteb/models/ru_sentence_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * change revision due to flash-attn dependency * change apply_instruction_to_passages --------- Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> * fix: Refactor split create_tables into static Benchmark methods (#3126) * feat - Split create_tables into static Benchmark methods * feat - format * Update mteb/leaderboard/table.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - remove search query;take benchmark result as input;addressing the circular import, * feat - format * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - use to_dataframe;clean table.py;move creat_table * feat - fix circular import * feat - clean-up * feat - format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.61 Automatically generated by python-semantic-release * Extending the RTEB benchmark (#3223) Adding another voyageai model * Update tasks & benchmarks tables * feat - filter_by_privacy * feat - add new fields for rteb part * feat - getattr * feat - adjust privacy filter logic * feat - enhance summary table column renaming and add 'is_public' field mapping * fix: remove unused 'is_public' attribute from TaskResult --------- Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Atheer <atheer2104@protonmail.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com> Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: smile <smile@pinai.io> Co-authored-by: ethan <smiletoye@gmail.com> * removed show_rteb args * avoid defining function where we can just use the metadata * minor fixes * minor fixes * fix: Correct logic for filtering public tasks in ModelResult class (#3230) Co-authored-by: ethan <smiletoye@gmail.com> --------- Co-authored-by: q275343119 <275343119@qq.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com> Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Atheer <atheer2104@protonmail.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com> Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> Co-authored-by: smile <smile@pinai.io> Co-authored-by: ethan <smiletoye@gmail.com> * Update tasks & benchmarks tables * 1.39.0 Automatically generated by python-semantic-release * fix: Add submission references for RTEB (#3233) * fix: Add rteb submission references and improve descriptions. * Added evaluation request * added field for tasks * 1.39.1 Automatically generated by python-semantic-release * dataset: add human tasks and benchmark (#3214) * Human Subsets Tasks * Fixed Multilingual Classification Subset * linting * fix citations format * make lint * fix tests * remove human folder * fix relative imports * add adapted_from for all human subsets * fix pydantic errors * add benchmark object * make benchmark discoverable * bibtex test * Apply suggestion Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * rename & reupload * upd tests * upd tests again * add model * add benchmark to leaderboard * change branch of leaderboard * remove branch of load data * fix model meta path * make mteb importable * update repo * Update mteb/benchmarks/benchmarks/benchmarks.py * Update mteb/leaderboard/benchmark_selector.py * Update mteb/load_results/load_results.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com> * Update tasks & benchmarks tables * Remove 'HUME(v1)' from leaderboard benchmark (#3236) * Remove 'HUME(v1)' from leaderboard benchmark * lint * docs: Update adding benchmark documentation (#3229) * update adding_a_benchmark.md documentation * fix numbers * fix: Further specified macro-language code for Norwegian (#3228) * fix: Further specified macro-language code for Norwegian "nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this. * furhter specified macro language "nor" * Update tasks & benchmarks tables * 1.39.2 Automatically generated by python-semantic-release * fix max tokens (#3243) * fix models * fix imports * fix task import * reupload HUME tasks * reupload SWE tasks * add stats * fix python39 transformers compatibility (#3254) * fix python39 transformers * fix * Aggregate by subset for HUMEv1 (#3255) aggregate by subset for HUMEv1 * Update tasks & benchmarks tables * Fix AbsTaskTextRegression task (#3257) Fix AbsTaskTextRegression * Added Japanese to Retrieval (#3252) * feat - add Japanese * feat - use mteb.get_benchmark * fix - 3.9 test error * Revert "fix - 3.9 test error" This reverts commit 6bfee53. * fix - 3.9 test error * Update tasks & benchmarks tables * fix bm25 on small datasets (#3261) --------- Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Geralt <94539084+Geralt-Targaryen@users.noreply.github.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Atheer <atheer2104@protonmail.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com> Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Peter <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: spring-quan <38248619+spring-quan@users.noreply.github.com> Co-authored-by: springxchen <springxchen@tencent.com> Co-authored-by: Tarun Suresh <68882529+tarsur909@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: q275343119 <275343119@qq.com> Co-authored-by: smile <smile@pinai.io> Co-authored-by: ethan <smiletoye@gmail.com> Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr> Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com> Co-authored-by: Niklas <n.muennighoff@gmail.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com>
* model: add image support for jina embeddings v4 (#2893)
* feat: unify text and image embeddings for all tasks
* fix: uniform batch size
* fix: update error message
* fix: update code task
* fix: update max length
* fix: apply review suggestions
* model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889)
* feat: add KaLM_Embedding_X_0605 in kalm_models
* Update kalm_models.py for lint format
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
---------
Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
* Add Classification Evaluator unit test (#2838)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: update colpali engine models (#2905)
* adding vidore benchmarks
* fix typo
* clean vidore names + per lang eval
* lint
* vidore names
* bibtex fix
* fix revision
* vidore v2 citation
* update citation format and fix per-language mappings
* lint: citations
* typo citations
* fix revisiions
* lint
* fix colnomic3b revision
* fix colqwen2.5 revision + latest repo version
* fix query agmentation tokens
* colsmol revision
* 1.38.35
Automatically generated by python-semantic-release
* Evaluator tests (#2910)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
* Adding STSEvaluator and SummarizationEvaluator tests
* Correcting due to the comments
* Correcting due to the comments
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Classification dataset cleaning (#2900)
* Classification dataset cleaning
* Update pull request number
* Fix metadata test
* fix formatting
* add script for cleaning
* Update tasks & benchmarks tables
* dataset: Add JapaneseSentimentClassification (#2913)
Add JapaneseSentimentClassification
* Update tasks & benchmarks tables
* fix: change `passage` prompt to `document` (#2912)
* change document to passage
* fix prompt names
* fix kwargs check
* fix default prompt
* 1.38.36
Automatically generated by python-semantic-release
* model: Add OpenSearch inf-free sparse encoding models (#2903)
add opensearch inf-free models
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* dataset: add BarExamQA dataset (#2916)
* Add BareExamQA retrieval task
* ran linter
* updated details
* updated details
* fixed subtype name
* fixed changes
* ran linter again
* Use `mteb.get_model` in adding_a_dataset.md (#2922)
Update adding_a_dataset.md
* fix: specify revision for opensearch (#2919)
specify revision for opensearch
* 1.38.37
Automatically generated by python-semantic-release
* Update the link for gemini-embedding-001 (#2928)
* fix: replace with passage (#2934)
* fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940)
* fix: Only import SparseEncoder once sentence-transformer version have been checked
fixes #2936
* Update mteb/models/opensearch_neural_sparse_models.py
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939)
The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.
* docs: Update adding_a_dataset.md (#2947)
* docs: Update adding_a_dataset.md
* Update docs/adding_a_dataset.md
* ci: bump semantic release
* 1.38.38
Automatically generated by python-semantic-release
* dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935)
* BSARD loader fixed
* BSARDv2 metadata fixed
* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tasks & benchmarks tables
* dataset: add GovReport dataset (#2953)
* Added govreport task
* Updated description
* dataset: add BillSum datasets (#2943)
* Added BillSum datasets
* fixed billsumca
* Updated BillSumCA description
* Updated BillSumUS description
* Update mteb/tasks/Retrieval/eng/BillSumCA.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/BillSumUS.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* lint
* lint
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716)
* Add RuSciBench
* fix bitext mining lang
* Add regression task
* fix init
* add missing files
* Improve description
* Add superseded_by
* fix lint
* Update regression task to match with v2
* Add stratified_subsampling for regression task
* Add boostrap for regression task
* Rename task class, add model as evaluator argument
* fix import
* fix import 2
* fixes
* fix
* Rename regression model protocol
* Update tasks & benchmarks tables
* 1.38.39
Automatically generated by python-semantic-release
* qzhou-embedding model_meta & implementation (#2975)
* qzhou-embedding model_meta & implementation
* Update qzhou_models.py
* Update qzhou_models.py
Processing todo items(Add default instruction)
* Update qzhou_models.py
correct bge datalist
* Update qzhou_models.py
correct 'public_training_data'
* Update qzhou_models.py
* Update qzhou_models.py
* Update qzhou_models.py
* Update qzhou_models.py
* Update mteb/models/qzhou_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/qzhou_models.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* format qzhou_models.py for ruff check
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* model: Add Voyage 3.5 model configuration (#3005)
Add Voyage 3.5 model configuration
- Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens
- Set release date to 2025-01-21 with revision 1
- Configure for cosine similarity with instruction support
- Include standard Voyage training datasets reference
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
* model: BAAI/bge-m3-unsupervised Model (#3007)
* Add BAAI/bge-m3-unsupervised Model
(BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out)
* Remove the commented retromae model
---------
Co-authored-by: fzowl <zoltan@voyageai.com>
* lint: Correcting lint errors (#3004)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
* Correcting the lint errors
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* dataset: Added 50 Vietnamese dataset from vn-mteb (#2964)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [REMOVE] default fields metadata in Classfication tasks
* Update tasks & benchmarks tables
* model: Add Cohere embed-v4.0 model support (#3006)
* Add Cohere embed-v4.0 model support
- Add text-only embed-v4.0 model in cohere_models.py
- Add multimodal embed-v4.0 model in cohere_v.py
- Support configurable dimensions (256, 512, 1024, 1536)
- Support 128,000 token context length
- Support multimodal embedding (text, images, mixed PDFs)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add Cohere embed-v4.0 model support
Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Add OpenAI models with 512 dimension (#3008)
* Add OpenAI/text-embedding-3-small (512 dim)
Add OpenAI/text-embedding-3-large (512 dim)
* Correcting due to comments
---------
Co-authored-by: fzowl <zoltan@voyageai.com>
* Standardise task names and fix citation formatting (#3026)
fixes for name formatting
* Update tasks & benchmarks tables
* fix: Add missing training sets for qzhou (#3023)
* Supplement missing training sets
* reformat code
* Reorganize the data list format
* update qzhou_model meta
* 1.38.40
Automatically generated by python-semantic-release
* model: Add samilpwc_models meta (#3028)
* model: Add samilpwc_models meta
* Fix: Remove CONST
* Fix: Reformat File
* Update: model revision
* model: Add granite-vision-embedding model (#3029)
* Add files via upload
* Address review comments
* Address review comments
* ruff format
* Update mteb/models/granite_vision_embedding_models.py
* lint error fix
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: incorrect revision for SNLRetrieval (#3033)
The provided revisions doesn't seem to be present on:
adrlau/navjordj-SNL_summarization_copy
Replacing with latest revision
* dataset: Add HumanEvalRetrieval task (#3022)
* Add HumanEvalRetrieval dataset
* Fix TaskMetadata structure and remove descriptive_stats
- Use TaskMetadata class instead of dict
- Remove descriptive_stats as requested in PR review
- Add date field and proper import structure
* Fix dataset path and use verified metadata
- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval
- Use actual description from HuggingFace dataset page
- Remove fabricated citation and reference
- Remove revision field that was incorrect
- Reference HuggingFace dataset page instead of arxiv
* Add correct revision hash to HumanEval
- Add revision hash: ed1f48a for reproducibility
* Fix HumanEval metadata validation
- Add date field for metadata completeness
- Add bibtex_citation field (empty string)
- Required for TaskMetadata validation to pass
- Should resolve PR test failure
* Address reviewer feedback
- Remove trust_remote_code parameter as requested
- Add revision parameter to load_dataset() calls for consistency
- Use metadata revision hash in dataset loading for reproducibility
* Fix field names in HumanEval dataset loading
Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.
* Fix deprecated metadata_dict usage
Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.
* Fix data structure for MTEB compatibility
- Organize data by splits as expected by MTEB retrieval tasks
- Convert scores to integers for pytrec_eval compatibility
* Address PR feedback for HumanEval dataset
- Add descriptive statistics using calculate_metadata_metrics()
- Enhance metadata description with dataset structure details
- Add complete BibTeX citation for original paper
- Update to full commit hash revision
- Add python-Code language tag for programming language
- Explain retrieval task formulation clearly
* Fix BibTeX citation formatting for HumanEvalRetrieval
- Update citation to match bibtexparser formatting requirements
- Fields now in alphabetical order with lowercase names
- Proper trailing commas and indentation
* Update tasks & benchmarks tables
* 1.38.41
Automatically generated by python-semantic-release
* ci: reduce parallel runs for when checking if a dataset exists (#3035)
The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831)
* ci: Updating rerun delays to prevent false positives errors
* ci: Updating rerun delays to prevent false positives errors
* model: Add GreenNode Vietnamese Embedding models (#2994)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] Vietnamese Embedding models
* [REMOVE] default fields metadata in Classfication tasks
* [UPDATE] model to vi-vn language specific file
* [FIX] lint
* [FIX] model loader
* model: add granite-embedding-english R2 models (#3050)
* fix: Updated revision for jina-embeddings-v4 (#3046)
* fix: jinav4 revision
Signed-off-by: admin <bo.wang@jina.ai>
* change revision instead of removing it
Signed-off-by: admin <bo.wang@jina.ai>
---------
Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: admin <bo.wang@jina.ai>
* 1.38.42
Automatically generated by python-semantic-release
* Fix 3 VN-MTEB Pair Classification tasks (#3053)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] Vietnamese Embedding models
* [REMOVE] default fields metadata in Classfication tasks
* [UPDATE] model to vi-vn language specific file
* [FIX] lint
* [FIX] model loader
* [FIX] VN-MTEB 3 datasets PairClassification rename column
* dataset: Add mbpp retrieval (#3037)
* Add MBPP retrieval task
- Code retrieval task based on 378 Python programming problems
- Natural language queries matched to Python code implementations
- Uses python-Code evaluation language for code-specific metrics
- Includes proper citations and descriptive statistics
* Add MBPPRetrieval to imports
* Add descriptive statistics for MBPPRetrieval
* Reformatting
* Reformatting
* Update tasks & benchmarks tables
* dataset: Added wikisql retrieval (#3039)
* Add WikiSQL retrieval task
- Code retrieval task based on WikiSQL natural language to SQL dataset
- Natural language questions matched to SQL query implementations
- Uses sql-Code evaluation language for SQL-specific metrics
- Includes proper citations and descriptive statistics
* Add WikiSQLRetrieval to imports
* Add descriptive statistics for WikiSQLRetrieval
* Reformatting
* Reformatting
* Reformatting, correcting the revision
* Update tasks & benchmarks tables
* ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors
try to fix CI
* fix MBPPRetrieval revision (#3055)
Update MBPPRetrieval.py
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* fix: Add VN-MTEB benchmark and Leaderboard (#2995)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] VN-MTEB benchmark and leaderboard
* [FIX] wrong benchmark name
* [REMOVE] default fields metadata in Classfication tasks
* Update tasks & benchmarks tables
* 1.38.43
Automatically generated by python-semantic-release
* Add hc3finance retrieval (#3041)
* Add HC3Finance retrieval task
- Financial retrieval task based on HC3 Finance dataset
- Financial questions matched to human and AI-generated content
- Covers financial explanations, analysis, and educational content
- Includes proper citations and descriptive statistics
* Add HC3FinanceRetrieval to imports
* Add descriptive statistics for HC3FinanceRetrieval
* Reformatting
* Reformatting, correcting the revision
* Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add finqa retrieval (#3042)
* Add FinQA retrieval task
- Financial numerical reasoning retrieval task based on FinQA dataset
- Numerical financial questions matched to relevant document data
- Covers earnings reports with tables and quantitative financial data
- Includes proper citations and descriptive statistics
* Add FinQARetrieval to imports
* Add descriptive statistics for FinQARetrieval
* Reformatting
* Reformatting
* Update mteb/tasks/Retrieval/eng/FinQARetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* Add FinanceBenchRetrieval task (#3044)
* Add FinanceBenchRetrieval
* Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* Add FreshStackRetrieval task (#3043)
* Add FreshStackRetrieval
* Reformatting, correcting the revision
* Dataset correction
* Update tasks & benchmarks tables
* dataset: Add ds1000 retrieval (#3038)
* Add DS1000 retrieval task
- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries
* Add DS1000Retrieval to imports
* Add descriptive statistics for DS1000Retrieval
* Reformatting
* Reformatting
* Update tasks & benchmarks tables
* Add ChatDoctorRetrieval (#3045)
* Add ChatDoctorRetrieval
* Reformatting, correcting the revision
* Correct the dataset citation
* Correcting due to comments
* Update tasks & benchmarks tables
* Correcting the (new) DS1000 dataset's revision (#3063)
* Add DS1000 retrieval task
- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries
* Add DS1000Retrieval to imports
* Add descriptive statistics for DS1000Retrieval
* Reformatting
* Reformatting
* Add DS1000Retrieval task implementation
* dataset: Add JinaVDR (#2942)
* feat: added jinavdr benchmark
* feat: added description for jinavdr
* feat: fixed licenses and added bibtex
* feat: made jinav4 compatible with vidore benchmark
* feat: corrected query numbers
* feat: removed print
* feat: added max pixel argument for jina models
* feat: score calculation on cpu
* feat: adjust jina model for new mteb code
* feat: code cleanup
* feat: corrected bibtex
* feat: make colpali run with jinavdr
* feat: fixed comments
* feat: better reference and fixed comments
* feat: added date for tasks
* feat: fixed missing metadata and bibtex
* feat: added descriptions per dataset
* Update tasks & benchmarks tables
* model: Add CoDi-Embedding-V1 (#3054)
* add codiemb-minicpm
* replace codiemb_minicpm with codi_model
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* update code
* update code
* reformat
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: ensure that there are always relevant docs attached to query (#3058)
* fix: ensure that there are always relevant docs attached to query
Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")
eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())
# before fix:
res[0].get_score() # np.float64(0.02837)
res[0].scores
before_fix = {
"train": [
{
"ndcg_at_1": 0.02597,
"ndcg_at_3": 0.02213,
"ndcg_at_5": 0.0262,
"ndcg_at_10": 0.02837,
"ndcg_at_20": 0.04548,
"ndcg_at_100": 0.13527,
"ndcg_at_1000": 0.24507,
"map_at_1": 0.00866,
"map_at_3": 0.01317,
"map_at_5": 0.0149,
"map_at_10": 0.01562,
"map_at_20": 0.01898,
"map_at_100": 0.02968,
"map_at_1000": 0.03841,
"recall_at_1": 0.00866,
"recall_at_3": 0.02056,
"recall_at_5": 0.02922,
"recall_at_10": 0.03355,
"recall_at_20": 0.08268,
"recall_at_100": 0.43766,
"recall_at_1000": 1.0,
"precision_at_1": 0.02597,
"precision_at_3": 0.02165,
"precision_at_5": 0.01818,
"precision_at_10": 0.01039,
"precision_at_20": 0.01234,
"precision_at_100": 0.01481,
"precision_at_1000": 0.0034,
"mrr_at_1": 0.025974,
"mrr_at_3": 0.041126,
"mrr_at_5": 0.04632,
"mrr_at_10": 0.048485,
"mrr_at_20": 0.058356,
"mrr_at_100": 0.070186,
"mrr_at_1000": 0.071349,
"nauc_ndcg_at_1_max": 0.33969,
"nauc_ndcg_at_1_std": -0.202864,
"nauc_ndcg_at_1_diff1": -0.127,
"nauc_ndcg_at_3_max": 0.409376,
"nauc_ndcg_at_3_std": -0.039352,
"nauc_ndcg_at_3_diff1": -0.022816,
"nauc_ndcg_at_5_max": 0.250499,
"nauc_ndcg_at_5_std": -0.115263,
"nauc_ndcg_at_5_diff1": -0.057017,
"nauc_ndcg_at_10_max": 0.238696,
"nauc_ndcg_at_10_std": -0.138396,
"nauc_ndcg_at_10_diff1": -0.045287,
"nauc_ndcg_at_20_max": 0.154456,
"nauc_ndcg_at_20_std": -0.070635,
"nauc_ndcg_at_20_diff1": 0.074499,
"nauc_ndcg_at_100_max": -0.005753,
"nauc_ndcg_at_100_std": -0.074738,
"nauc_ndcg_at_100_diff1": -0.005851,
"nauc_ndcg_at_1000_max": 0.109439,
"nauc_ndcg_at_1000_std": -0.089797,
"nauc_ndcg_at_1000_diff1": -0.021634,
"nauc_map_at_1_max": 0.33969,
"nauc_map_at_1_std": -0.202864,
"nauc_map_at_1_diff1": -0.127,
"nauc_map_at_3_max": 0.385244,
"nauc_map_at_3_std": -0.080638,
"nauc_map_at_3_diff1": -0.060991,
"nauc_map_at_5_max": 0.294871,
"nauc_map_at_5_std": -0.119069,
"nauc_map_at_5_diff1": -0.06234,
"nauc_map_at_10_max": 0.285698,
"nauc_map_at_10_std": -0.132856,
"nauc_map_at_10_diff1": -0.055015,
"nauc_map_at_20_max": 0.236619,
"nauc_map_at_20_std": -0.100673,
"nauc_map_at_20_diff1": -0.002619,
"nauc_map_at_100_max": 0.15345,
"nauc_map_at_100_std": -0.138888,
"nauc_map_at_100_diff1": -0.02257,
"nauc_map_at_1000_max": 0.171402,
"nauc_map_at_1000_std": -0.134644,
"nauc_map_at_1000_diff1": -0.034477,
"nauc_recall_at_1_max": 0.33969,
"nauc_recall_at_1_std": -0.202864,
"nauc_recall_at_1_diff1": -0.127,
"nauc_recall_at_3_max": 0.375072,
"nauc_recall_at_3_std": -0.009643,
"nauc_recall_at_3_diff1": -0.089168,
"nauc_recall_at_5_max": 0.147691,
"nauc_recall_at_5_std": -0.128654,
"nauc_recall_at_5_diff1": -0.084259,
"nauc_recall_at_10_max": 0.141055,
"nauc_recall_at_10_std": -0.165932,
"nauc_recall_at_10_diff1": -0.060966,
"nauc_recall_at_20_max": 0.043863,
"nauc_recall_at_20_std": -0.028374,
"nauc_recall_at_20_diff1": 0.157575,
"nauc_recall_at_100_max": -0.157183,
"nauc_recall_at_100_std": -0.019437,
"nauc_recall_at_100_diff1": 0.013395,
# "nauc_recall_at_1000_max": nan,
# "nauc_recall_at_1000_std": nan,
# "nauc_recall_at_1000_diff1": nan,
"nauc_precision_at_1_max": 0.33969,
"nauc_precision_at_1_std": -0.202864,
"nauc_precision_at_1_diff1": -0.127,
"nauc_precision_at_3_max": 0.406318,
"nauc_precision_at_3_std": 0.007031,
"nauc_precision_at_3_diff1": -0.034709,
"nauc_precision_at_5_max": 0.178131,
"nauc_precision_at_5_std": -0.112493,
"nauc_precision_at_5_diff1": -0.045535,
"nauc_precision_at_10_max": 0.167897,
"nauc_precision_at_10_std": -0.150626,
"nauc_precision_at_10_diff1": -0.027811,
"nauc_precision_at_20_max": 0.081428,
"nauc_precision_at_20_std": -0.042304,
"nauc_precision_at_20_diff1": 0.17278,
"nauc_precision_at_100_max": -0.150619,
"nauc_precision_at_100_std": 0.016133,
"nauc_precision_at_100_diff1": -0.065571,
"nauc_precision_at_1000_max": -0.017244,
"nauc_precision_at_1000_std": 0.046614,
"nauc_precision_at_1000_diff1": -0.028258,
"nauc_mrr_at_1_max": 0.33969,
"nauc_mrr_at_1_std": -0.202864,
"nauc_mrr_at_1_diff1": -0.127,
"nauc_mrr_at_3_max": 0.409511,
"nauc_mrr_at_3_std": -0.064671,
"nauc_mrr_at_3_diff1": -0.01911,
"nauc_mrr_at_5_max": 0.319584,
"nauc_mrr_at_5_std": -0.103546,
"nauc_mrr_at_5_diff1": -0.025109,
"nauc_mrr_at_10_max": 0.309614,
"nauc_mrr_at_10_std": -0.117564,
"nauc_mrr_at_10_diff1": -0.019691,
"nauc_mrr_at_20_max": 0.262976,
"nauc_mrr_at_20_std": -0.092222,
"nauc_mrr_at_20_diff1": 0.024507,
"nauc_mrr_at_100_max": 0.256052,
"nauc_mrr_at_100_std": -0.094249,
"nauc_mrr_at_100_diff1": 0.012432,
"nauc_mrr_at_1000_max": 0.260112,
"nauc_mrr_at_1000_std": -0.098845,
"nauc_mrr_at_1000_diff1": 0.009697,
"main_score": 0.02837,
"hf_subset": "default",
"languages": ["dan-Latn"],
}
]
}
# with update:
res[0].get_score() # np.float64(0.02837)
res[0].scores
with_fix = {
"train": [
{
"ndcg_at_1": 0.02597,
"ndcg_at_3": 0.02213,
"ndcg_at_5": 0.0262,
"ndcg_at_10": 0.02837,
"ndcg_at_20": 0.04548,
"ndcg_at_100": 0.13527,
"ndcg_at_1000": 0.24507,
"map_at_1": 0.00866,
"map_at_3": 0.01317,
"map_at_5": 0.0149,
"map_at_10": 0.01562,
"map_at_20": 0.01898,
"map_at_100": 0.02968,
"map_at_1000": 0.03841,
"recall_at_1": 0.00866,
"recall_at_3": 0.02056,
"recall_at_5": 0.02922,
"recall_at_10": 0.03355,
"recall_at_20": 0.08268,
"recall_at_100": 0.43766,
"recall_at_1000": 1.0,
"precision_at_1": 0.02597,
"precision_at_3": 0.02165,
"precision_at_5": 0.01818,
"precision_at_10": 0.01039,
"precision_at_20": 0.01234,
"precision_at_100": 0.01481,
"precision_at_1000": 0.0034,
"mrr_at_1": 0.025974,
"mrr_at_3": 0.041126,
"mrr_at_5": 0.04632,
"mrr_at_10": 0.048485,
"mrr_at_20": 0.058356,
"mrr_at_100": 0.070186,
"mrr_at_1000": 0.071349,
"nauc_ndcg_at_1_max": 0.33969,
"nauc_ndcg_at_1_std": -0.202864,
"nauc_ndcg_at_1_diff1": -0.127,
"nauc_ndcg_at_3_max": 0.409376,
"nauc_ndcg_at_3_std": -0.039352,
"nauc_ndcg_at_3_diff1": -0.022816,
"nauc_ndcg_at_5_max": 0.250499,
"nauc_ndcg_at_5_std": -0.115263,
"nauc_ndcg_at_5_diff1": -0.057017,
"nauc_ndcg_at_10_max": 0.238696,
"nauc_ndcg_at_10_std": -0.138396,
"nauc_ndcg_at_10_diff1": -0.045287,
"nauc_ndcg_at_20_max": 0.154456,
"nauc_ndcg_at_20_std": -0.070635,
"nauc_ndcg_at_20_diff1": 0.074499,
"nauc_ndcg_at_100_max": -0.005753,
"nauc_ndcg_at_100_std": -0.074738,
"nauc_ndcg_at_100_diff1": -0.005851,
"nauc_ndcg_at_1000_max": 0.109439,
"nauc_ndcg_at_1000_std": -0.089797,
"nauc_ndcg_at_1000_diff1": -0.021634,
"nauc_map_at_1_max": 0.33969,
"nauc_map_at_1_std": -0.202864,
"nauc_map_at_1_diff1": -0.127,
"nauc_map_at_3_max": 0.385244,
"nauc_map_at_3_std": -0.080638,
"nauc_map_at_3_diff1": -0.060991,
"nauc_map_at_5_max": 0.294871,
"nauc_map_at_5_std": -0.119069,
"nauc_map_at_5_diff1": -0.06234,
"nauc_map_at_10_max": 0.285698,
"nauc_map_at_10_std": -0.132856,
"nauc_map_at_10_diff1": -0.055015,
"nauc_map_at_20_max": 0.236619,
"nauc_map_at_20_std": -0.100673,
"nauc_map_at_20_diff1": -0.002619,
"nauc_map_at_100_max": 0.15345,
"nauc_map_at_100_std": -0.138888,
"nauc_map_at_100_diff1": -0.02257,
"nauc_map_at_1000_max": 0.171402,
"nauc_map_at_1000_std": -0.134644,
"nauc_map_at_1000_diff1": -0.034477,
"nauc_recall_at_1_max": 0.33969,
"nauc_recall_at_1_std": -0.202864,
"nauc_recall_at_1_diff1": -0.127,
"nauc_recall_at_3_max": 0.375072,
"nauc_recall_at_3_std": -0.009643,
"nauc_recall_at_3_diff1": -0.089168,
"nauc_recall_at_5_max": 0.147691,
"nauc_recall_at_5_std": -0.128654,
"nauc_recall_at_5_diff1": -0.084259,
"nauc_recall_at_10_max": 0.141055,
"nauc_recall_at_10_std": -0.165932,
"nauc_recall_at_10_diff1": -0.060966,
"nauc_recall_at_20_max": 0.043863,
"nauc_recall_at_20_std": -0.028374,
"nauc_recall_at_20_diff1": 0.157575,
"nauc_recall_at_100_max": -0.157183,
"nauc_recall_at_100_std": -0.019437,
"nauc_recall_at_100_diff1": 0.013395,
# "nauc_recall_at_1000_max": nan,
# "nauc_recall_at_1000_std": nan,
# "nauc_recall_at_1000_diff1": nan,
"nauc_precision_at_1_max": 0.33969,
"nauc_precision_at_1_std": -0.202864,
"nauc_precision_at_1_diff1": -0.127,
"nauc_precision_at_3_max": 0.406318,
"nauc_precision_at_3_std": 0.007031,
"nauc_precision_at_3_diff1": -0.034709,
"nauc_precision_at_5_max": 0.178131,
"nauc_precision_at_5_std": -0.112493,
"nauc_precision_at_5_diff1": -0.045535,
"nauc_precision_at_10_max": 0.167897,
"nauc_precision_at_10_std": -0.150626,
"nauc_precision_at_10_diff1": -0.027811,
"nauc_precision_at_20_max": 0.081428,
"nauc_precision_at_20_std": -0.042304,
"nauc_precision_at_20_diff1": 0.17278,
"nauc_precision_at_100_max": -0.150619,
"nauc_precision_at_100_std": 0.016133,
"nauc_precision_at_100_diff1": -0.065571,
"nauc_precision_at_1000_max": -0.017244,
"nauc_precision_at_1000_std": 0.046614,
"nauc_precision_at_1000_diff1": -0.028258,
"nauc_mrr_at_1_max": 0.33969,
"nauc_mrr_at_1_std": -0.202864,
"nauc_mrr_at_1_diff1": -0.127,
"nauc_mrr_at_3_max": 0.409511,
"nauc_mrr_at_3_std": -0.064671,
"nauc_mrr_at_3_diff1": -0.01911,
"nauc_mrr_at_5_max": 0.319584,
"nauc_mrr_at_5_std": -0.103546,
"nauc_mrr_at_5_diff1": -0.025109,
"nauc_mrr_at_10_max": 0.309614,
"nauc_mrr_at_10_std": -0.117564,
"nauc_mrr_at_10_diff1": -0.019691,
"nauc_mrr_at_20_max": 0.262976,
"nauc_mrr_at_20_std": -0.092222,
"nauc_mrr_at_20_diff1": 0.024507,
"nauc_mrr_at_100_max": 0.256052,
"nauc_mrr_at_100_std": -0.094249,
"nauc_mrr_at_100_diff1": 0.012432,
"nauc_mrr_at_1000_max": 0.260112,
"nauc_mrr_at_1000_std": -0.098845,
"nauc_mrr_at_1000_diff1": 0.009697,
"main_score": 0.02837,
"hf_subset": "default",
"languages": ["dan-Latn"],
}
]
}
# check
with_fix == before_fix # True
* restructure
* format
* relax pytrec versions
* fix incorrect parsing
* 1.38.44
Automatically generated by python-semantic-release
* Correcting the JINA models with SentenceTransformerWrapper (#3071)
* ci: Add stale workflow (#3066)
* add stale workflow
* add permissions
* add bug label to bug issue template
* revert bug issue and only look at more info needed issues
* more accurate name
* override default
* fix: open_clip package validation (#3073)
* 1.38.45
Automatically generated by python-semantic-release
* fix: Update revision for qzhou models (#3069)
* 1.38.46
Automatically generated by python-semantic-release
* Fix the reference link for CoDi-Embedding-V1 (#3075)
Fix reference link
* fix: Add beta version of RTEB related benchmarks (#3048)
* Add RTEB related benchmarks
* Add RTEB related benchmarks
* Correcting the task names in the RTEB benchmarks
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Adding the CURE dataset to RTEB benchmarks
* Use the right language subset
* Fix broken finance icon URL in RTEB benchmarks
Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg
Validated all icon URLs and confirmed accessibility compliance
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* 1.38.47
Automatically generated by python-semantic-release
* fix: run `ruff check` on all files during ci (#3086)
* fix: run `ruff check` on all files during ci
* format
* 1.38.48
Automatically generated by python-semantic-release
* Move dev to dependency groups (#3088)
add dependency groups
* fix: Improving validate_task_to_prompt_name logs and error messages (#3079)
* Improving validate_task_to_prompt_name logs and error messages
* linter fixes
* Adding None prompts tests
* Update test_benchmark_sentence_transformer
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: duplicate mteb multilingual variables (#3080)
* fix benchmark naming
* format
* lint
* Update tasks & benchmarks tables
* model: mdbr-leaf models (#3081)
* added MDBR leaf models
* fixed revision for mdbr-leaf-ir
* added model prompts
* updated training datasets
* fixed linting
* lotte task reference
---------
Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>
* 1.38.49
Automatically generated by python-semantic-release
* CI: Set upper limit for xdist version (#3098)
* Commentout bibtex formatting
* Remove `-n auto`
* get back bibtex
* try limiting versions
* revert coverage
* revert coverage
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Combine Plots and Tables into a Single (#3047)
* feat - Combine Plots and Tables into a Single Tab #3009
* feat - Resize the plot to make it more readable
* feat - Remove the (radar chart)
* feat - Add a comment stating that it only shows the Top 5 models in the table.
* feat - adjust layout
* Update mteb/leaderboard/app.py
* format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* fix: Updating the default batch size calculation in the voyage models (#3091)
* 1.38.50
Automatically generated by python-semantic-release
* fix: Add @classmethod for @field_validators in TaskMetadata (#3100)
* Align task prompt dict with `PromptType` (#3101)
* align task prompt dict with `PromptType`
* use value instead of enum
* 1.38.51
Automatically generated by python-semantic-release
* model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090)
* Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1
* Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth)
* Format with ruff + add loader per review
* Apply ruff format/fixes
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader)
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix import
* Add memory_usage_mb=808.0 and required fields to ModelMeta
* Fix 210 milions of parameters
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: Allow closed datasets (#3059)
* - Added an include_private parameter to the get_tasks() function that defaults to False
- This ensures that by default, tests only run on public datasets
- Tests can explicitly set include_private=True when needed to test private datasets
- Added is_public: bool | None = None field to TaskMetadata
- The field is optional and defaults to None (treated as public)
- Updated the is_filled() method to exclude is_public from required fields
- Added documentation
* - Added an include_private parameter to the get_tasks() function that defaults to False
- This ensures that by default, tests only run on public datasets
- Tests can explicitly set include_private=True when needed to test private datasets
- Added is_public: bool | None = None field to TaskMetadata
- The field is optional and defaults to None (treated as public)
- Updated the is_filled() method to exclude is_public from required fields
- Added documentation
* Correcting due to comments
* Update mteb/abstasks/TaskMetadata.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/overview.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Removing the not used filter_tasks_by_privacy function
* Correcting due to comments
* Correcting due to comments
* Correcting due to comments
* Removing the test case
* Rename the include_private parameter to exclude_private
* Rename the include_private parameter to exclude_private
* Add private tasks tests
* Add private tasks tests
* Update tests/test_tasks/test_private_tasks.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Add private tasks tests
* Add private tasks tests
* Add private tasks tests
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.52
Automatically generated by python-semantic-release
* Ci: test out GH models with welcoming new comers (#3112)
test out GH models with welcoming new comers
* ci: Dataset check on new PR (#3103)
* add dataset check on new PR
* add extract datasets
* run as module
* update startswith
* update workflow name
* add GitPython
* export var
* same shell session
* address review comments
* add to docs to say what this script does
* add docs
* model: add Youtu-Embedding-V1 (#3115)
* add youtu models
* add a blank line
* fix the optional dependencies and lint the code
* remove unused dependencies and reformat
* revise prompt_type
---------
Co-authored-by: springxchen <springxchen@tencent.com>
* fix: add voyage quantization models (#3092)
* Adding quantization support
* Update mteb/models/voyage_models.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Simplifying the quantization/output_dtype
* Update mteb/model_meta.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* 1.38.53
Automatically generated by python-semantic-release
* model: EmbeddingGemma 300M (#3129)
* model: EmbeddingGemma 300M
* Add license and revision
* fix: Add dedicated display for RTEB benchmark results (#3089)
* feat - remove special filtering, keep zero-shot, keep borda rank
* feat - remove get_rteb_benchmark.py
* feat - delete get_rteb_benchmark.py;RTEB_BENCHMARK_ENTRIES changes
* feat -format
* Update mteb/load_results/benchmark_results.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* 1.38.54
Automatically generated by python-semantic-release
* dataset: Add Dapfam patent retrieval tasks (#2946)
* chore: add 'Patent retrieval' subtype to TaskMetadata
* feat(retrieval): add DAPFAM patent retrieval tasks (+18 variants)
* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)
* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Changes :
- Added possibility to opt in or out of quantization through the "quantize" argument.
- Added possibility to compute raw dot product without normalization. (to reproduce the paper method the "similarity" argument should be "cosine").
- Removed unecessary function and overhauled the tasks descriptions to be more clear.
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Changes made :
- Overhauled task descriptions as well as naming to conform with the naming scheme of mteb retrieval tasks.
- Similarity is now computed using the similarity function of the passed model.
- Changed optional quantization method to conform with sentence transformers similarity function.
to reproduce the paper metrics, one can use the following snippet :
```python
from mteb import mteb
from sentence_transformers import SentenceTransformer
model_name = "Snowflake/snowflake-arctic-embed-m-v2.0"
model = SentenceTransformer(model_name,
model_kwargs={
"torch_dtype": "float16",
},
trust_remote_code=True,
).cuda().eval()
tasks = mteb.get_tasks(tasks=[
"DAPFAMInTitlAbsToTitlAbsClmRetrieval",
"DAPFAMAllTitlAbsToTitlAbsClmRetrieval",
"DAPFAMOutTitlAbsToTitlAbsClmRetrieval",
add the other 3 remaining tasks ...
])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
output_folder=f"mteb_res/{model_name}",
quantize=True, # if set to false or not set, the obtained ndcg@10 and map@10 will be ~0.001 higher
encode_kwargs= {"batch_size" : 32}
)
```
* changed default value of quantization to false
* added the import to all DAPFAM tasks; tested that the works; verified compliance with the checklist
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* added revision numbers to all dataset loading operations as well as the metadata itself
* intermediate changes, refresh local branch
* intermediate changes, refresh local branch again
* scale back to standard evaluation with empty set exclusion, various cosmetic/formatting changes
* minor cosmetic/formatting changes
* fixed main metric to be ndcg_at_100 as in the paper
* removed old code artifacts from previous versions
* read appropriate loading arguments from task metadata, remove unecessary class attribute
* reformat bibtex ( remark on the assertion since it tries to match literal string instead of bibtex formatting, format inconsistent with arXiv default), fixed metadata, parameters read from task metadata, all tests passed
* refactor data loading to read from metadata class attributes
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* Align max tokens (#3172)
* Correct the VoyageAI model's batch creation/batch size calculation (#3185)
Correct the batch creation
* dataset: Adding JapaneseCode1Retrieval as the first non-public dataset (#3168)
* Adding JapaneseCode1Retrieval as the first non-public dataset
* Transformed dataset
* Adding as private dataset to tests
* Correct the private task test
* Use the sample dataset as a reference
* Use the sample dataset as a reference
* fix ds loading
* allow on forks
* upd aciton
* remove paths
* try to trigger ci
* add ref
* add permissions
* remove paths
* add paths back
* get back to pull request
* rollback action
* Trying to resolve the token/secret problem
* Trying to resolve the token/secret problem
* Update dataset_loading_pr.yml
* Update dataset_loading_pr.yml
* Try the latest datasets package (worked for me)
* Try the latest datasets package (worked for me)
* Try the latest datasets package (worked for me)
* (last?) try
* (last?) try
* (last?) try
* Reverting the changes
* Exclude the private datasets from tests
* Apply suggestions from code review
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Solomatin Roman <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: add version check for `embeddinggemma-300m` (#3189)
add version check
* dataset: Added a set of closed datasets (#3186)
* Add 12 more closed datasets
Extend the RTEB benchmarks
* trust_remote_code
* trust_remote_code
* Enabling JapaneseCode1Retrieval in the RTEB benchmarks
* Add closed datasets as private tasks
* Correct due to the comment
* Update tasks & benchmarks tables
* fix: Edit ack & sponsors (#3187)
* dataset: Update FaMTEB to Version 2 (#3157)
* Update benchmark to version 2
* make others in benchmark selector one line code
* small changes
* update a few tasks metadata
* update faintent license with correct form
* remove redundant trust remote codes
* fix hardnegatives revision
* make lint
* fix errors
* apply suggestions
* fix citation problem
* add PR link to benchmark desc
* remove duplicate dataset names in mcinext_models
* update prompts
---------
Co-authored-by: mehran <mehan.sarmadi16@gmail.com>
* Update tasks & benchmarks tables
* 1.38.55
Automatically generated by python-semantic-release
* fix: Add conflicting dependencies to toml (#3191)
fix conflict dependencies
* 1.38.56
Automatically generated by python-semantic-release
* fix: Correct metadata for ArguAna dataset (#3202)
* Update tasks & benchmarks tables
* 1.38.57
Automatically generated by python-semantic-release
* model: Add BMRetriever (#3195)
* model: Add BMRetriever
* Update mteb/models/bmretriever_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/bmretriever_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: remove trust_remote_code option
* feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper
* refactor: update training datasets for bmretriever
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Revert "Ci: test out GH models with welcoming new comers" (#3206)
Revert "Ci: test out GH models with welcoming new comers (#3112)"
This reverts commit 73a35e0bb02e61108d50385f4c43fd7d1b16e984.
* model: Add Codefuse models (#3205)
* add codefuse models
* add codefuse models
* Update codefuse_models.py
* lint codefuse.py
* fix(models): ensure prompt_type is passed to format_instruction (#3216)
* 1.38.58
Automatically generated by python-semantic-release
* Adding Cohere's output_dimension and embedding_type parameter (#3204)
* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8
* Correcting due to comments
* dataset: add swedish cpc patent classifications to mteb (#3072)
* feat: add swedish cpc patent classifications to mteb
* fix: formatting and init imports
* fix: update mteb task according to feedback
* fix: perform citation and code formatting
* fix: add train and test split for both datasets
* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior
* chore: fix colpali_models similarity handle device
* Update tasks & benchmarks tables
* 1.38.59
Automatically generated by python-semantic-release
* fix: prevent EOS token truncation (#3218)
* fix(models): prevent EOS token truncation for BMRetriever
* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`
* fix(models): correct eos token handling in `BMRetrieverWrapper`
* 1.38.60
Automatically generated by python-semantic-release
* Update giga embeddings (#3210)
* update giga embeddings
* update giga embeddings
* 3b-september-2025
* fixed
* lint
* Update mteb/models/ru_sentence_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* change revision due to flash-attn dependency
* change apply_instruction_to_passages
---------
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
* fix: Refactor split create_tables into static Benchmark methods (#3126)
* feat - Split create_tables into static Benchmark methods
* feat - format
* Update mteb/leaderboard/table.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - remove search query;take benchmark result as input;addressing the circular import,
* feat - format
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - use to_dataframe;clean table.py;move creat_table
* feat - fix circular import
* feat - clean-up
* feat - format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.61
Automatically generated by python-semantic-release
* Extending the RTEB benchmark (#3223)
Adding another voyageai model
* Update tasks & benchmarks tables
* model: New qzmodel (#3211)
* Update qzhou_models.py
* Update qzhou_models.py
* reformat script code
* Update configuration
* According to our new decision, the model name has been changed to "QZhou-Embedding-Zh".
* Fix variable naming issues.
* model: Update Youtu embedding model (#3227)
* add youtu models
* add a blank line
* fix the optional dependencies and lint the code
* remove unused dependencies and reformat
* revise prompt_type
* update youtu_models
---------
Co-authored-by: springxchen <springxchen@tencent.com>
* dataset: Add Software Issue Localization Datasets (#3178)
* add software issue localization datasets
* add software issue localization datasets
* update and add multilingual datasets
* fix citation format issues
* Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py
* fix linting issues
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* feat: Officially include RTEB in the leaderboard (#3222)
* feat - adjust Rteb's Benchmark
* feat - add blank
* fix menu names
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* moving around tasks
* fix: Update RTEB summary columns (#3226)
* fix(models): ensure prompt_type is passed to format_instruction (#3216)
* 1.38.58
Automatically generated by python-semantic-release
* Adding Cohere's output_dimension and embedding_type parameter (#3204)
* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8
* Correcting due to comments
* dataset: add swedish cpc patent classifications to mteb (#3072)
* feat: add swedish cpc patent classifications to mteb
* fix: formatting and init imports
* fix: update mteb task according to feedback
* fix: perform citation and code formatting
* fix: add train and test split for both datasets
* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior
* chore: fix colpali_models similarity handle device
* Update tasks & benchmarks tables
* 1.38.59
Automatically generated by python-semantic-release
* fix: prevent EOS token truncation (#3218)
* fix(models): prevent EOS token truncation for BMRetriever
* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`
* fix(models): correct eos token handling in `BMRetrieverWrapper`
* 1.38.60
Automatically generated by python-semantic-release
* Update giga embeddings (#3210)
* update giga embeddings
* update giga embeddings
* 3b-september-2025
* fixed
* lint
* Update mteb/models/ru_sentence_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* change revision due to flash-attn dependency
* change apply_instruction_to_passages
---------
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
* fix: Refactor split create_tables into static Benchmark methods (#3126)
* feat - Split create_tables into static Benchmark methods
* feat - format
* Update mteb/leaderboard/table.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - remove search query;take benchmark result as input;addressing the circular import,
* feat - format
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - use to_dataframe;clean table.py;move creat_table
* feat - fix circular import
* feat - clean-up
* feat - format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.61
Automatically generated by python-semantic-release
* Extending the RTEB benchmark (#3223)
Adding another voyageai model
* Update tasks & benchmarks tables
* feat - filter_by_privacy
* feat - add new fields for rteb part
* feat - getattr
* feat - adjust privacy filter logic
* feat - enhance summary table column renaming and add 'is_public' field mapping
* fix: remove unused 'is_public' attribute from TaskResult
---------
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
* removed show_rteb args
* avoid defining function where we can just use the metadata
* minor fixes
* minor fixes
* fix: Correct logic for filtering public tasks in ModelResult class (#3230)
Co-authored-by: ethan <smiletoye@gmail.com>
---------
Co-authored-by: q275343119 <275343119@qq.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
* Update tasks & benchmarks tables
* 1.39.0
Automatically generated by python-semantic-release
* fix: Add submission references for RTEB (#3233)
* fix: Add rteb submission references and improve descriptions.
* Added evaluation request
* added field for tasks
* 1.39.1
Automatically generated by python-semantic-release
* dataset: add human tasks and benchmark (#3214)
* Human Subsets Tasks
* Fixed Multilingual Classification Subset
* linting
* fix citations format
* make lint
* fix tests
* remove human folder
* fix relative imports
* add adapted_from for all human subsets
* fix pydantic errors
* add benchmark object
* make benchmark discoverable
* bibtex test
* Apply suggestion
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* rename & reupload
* upd tests
* upd tests again
* add model
* add benchmark to leaderboard
* change branch of leaderboard
* remove branch of load data
* fix model meta path
* make mteb importable
* update repo
* Update mteb/benchmarks/benchmarks/benchmarks.py
* Update mteb/leaderboard/benchmark_selector.py
* Update mteb/load_results/load_results.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com>
* Update tasks & benchmarks tables
* Remove 'HUME(v1)' from leaderboard benchmark (#3236)
* Remove 'HUME(v1)' from leaderboard benchmark
* lint
* docs: Update adding benchmark documentation (#3229)
* update adding_a_benchmark.md documentation
* fix numbers
* fix: Further specified macro-language code for Norwegian (#3228)
* fix: Further specified macro-language code for Norwegian
"nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this.
* furhter specified macro language "nor"
* Update tasks & benchmarks tables
* 1.39.2
Automatically generated by python-semantic-release
* fix max tokens (#3243)
* fix python39 transformers compatibility (#3254)
* fix python39 transformers
* fix
* Aggregate by subset for HUMEv1 (#3255)
aggregate by subset for HUMEv1
* Update tasks & benchmarks tables
* Fix AbsTaskTextRegression task (#3257)
Fix AbsTaskTextRegression
* Added Japanese to Retrieval (#3252)
* feat - add Japanese
* feat - use mteb.get_benchmark
* fix - 3.9 test error
* Revert "fix - 3.9 test error"
This reverts commit 6bfee53cff48304cc22d8248aa275dcc9e385475.
* fix - 3.9 test error
* Update tasks & benchmarks tables
* fix bm25 on small datasets (#3261)
* fix: Move zero-shot percentage calculation to the end of summary (#3231)
* Refactor: Move zero-shot percentage calculation to the end of summary table creation which only apply to RTEB table.
* Update RTEB benchmark name from "RTEB(beta)" to "RTEB" for consistency in display.
* feat - RTEB(beta)
* feat - remove Zero-shot
---------
Co-authored-by: ethan <smiletoye@gmail.com>
* model: Add ReasonIR (#3221)
* model: Add ReasonIR
* Update mteb/models/reasonir_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/reasonir_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* update n_parameters of ReasonIR
Co-authored-by: Niklas <n.muennighoff@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Niklas <n.muennighoff@gmail.com>
* fix: Only pin model name and rank (#3263)
Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column
* 1.39.3
Automatically generated by python-semantic-release
* fix: resolve flash-attention dependency issue (#3265)
* fix: Only pin model name and rank
Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column
* fix: resolve flash-attention dependency issue
This has been tested and works.
fixed Resolve flash-attention dependency issues
Fixes #3240
* 1.39.4
Automatically generated by python-semantic-release
* fix: Add retry and token counting in Cohere models (#3253)
* Retry and token counting in Cohere models
* Retry and token counting in Cohere models
* Retry and token counting in Cohere models
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* 1.39.5
Automatically generated by python-semantic-release
* Align MIEB leaderboards with paper (#3272)
* sort by mean task type and use pure rank for MIEB LBs
* lint
* rename task type column for readability
* fix: add prompt for MIRACLRetrievalHardNegatives (#3266)
* add prompt for MIRACLRetrievalHardNegatives
* add `MIRACLRetrievalHardNegatives.v2`
* Update mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* move common metadata to dict
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tasks & benchmarks tables
* Add Regression task mock (#3271)
* 1.39.6
Automatically generated by python-semantic-release
* fix: Change language for task SlovakMovieReviewSentimentClassification (#3296)
* Update tasks & benchmarks tables
* 1.39.7
Automatically generated by python-semantic-release
* Add english code retriever model (#3302)
* Add en code retriever model
* fix model_name
* Update mteb/models/en_code_retriever.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* correct lint
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* docs: fix typos in `docs/adding_a_benchmark.md` (#3344)
* BREAKING: v2.0.0 (#1433)
* [v2] Merge…
Cherry-picked commits from #3213 from @AdnanElAssadi56