Add english code retriever model#3302
Merged
Samoed merged 4 commits intoembeddings-benchmark:mainfrom Oct 10, 2025
Merged
Conversation
Samoed
approved these changes
Oct 9, 2025
Member
Samoed
left a comment
There was a problem hiding this comment.
Could you submit results of your model to the results repository?
Samoed
reviewed
Oct 9, 2025
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
6 tasks
Contributor
Author
Hi! I've submitted results and sent PR#296 |
Samoed
added a commit
that referenced
this pull request
Oct 12, 2025
* fix: Move zero-shot percentage calculation to the end of summary (#3231) * Refactor: Move zero-shot percentage calculation to the end of summary table creation which only apply to RTEB table. * Update RTEB benchmark name from "RTEB(beta)" to "RTEB" for consistency in display. * feat - RTEB(beta) * feat - remove Zero-shot --------- Co-authored-by: ethan <smiletoye@gmail.com> * model: Add ReasonIR (#3221) * model: Add ReasonIR * Update mteb/models/reasonir_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/reasonir_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update n_parameters of ReasonIR Co-authored-by: Niklas <n.muennighoff@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Niklas <n.muennighoff@gmail.com> * fix: Only pin model name and rank (#3263) Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column * 1.39.3 Automatically generated by python-semantic-release * fix: resolve flash-attention dependency issue (#3265) * fix: Only pin model name and rank Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column * fix: resolve flash-attention dependency issue This has been tested and works. fixed Resolve flash-attention dependency issues Fixes #3240 * 1.39.4 Automatically generated by python-semantic-release * fix: Add retry and token counting in Cohere models (#3253) * Retry and token counting in Cohere models * Retry and token counting in Cohere models * Retry and token counting in Cohere models --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * 1.39.5 Automatically generated by python-semantic-release * Align MIEB leaderboards with paper (#3272) * sort by mean task type and use pure rank for MIEB LBs * lint * rename task type column for readability * fix: add prompt for MIRACLRetrievalHardNegatives (#3266) * add prompt for MIRACLRetrievalHardNegatives * add `MIRACLRetrievalHardNegatives.v2` * Update mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * move common metadata to dict --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * Add Regression task mock (#3271) * 1.39.6 Automatically generated by python-semantic-release * fix: Change language for task SlovakMovieReviewSentimentClassification (#3296) * Update tasks & benchmarks tables * 1.39.7 Automatically generated by python-semantic-release * Add english code retriever model (#3302) * Add en code retriever model * fix model_name * Update mteb/models/en_code_retriever.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * correct lint --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * updates after merge * fix tests * fix tests --------- Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com> Co-authored-by: ethan <smiletoye@gmail.com> Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: Niklas <n.muennighoff@gmail.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Настя Шахматова <90767498+Drozhzhinastya@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: Andrej Ridzik <andrej.ridzik@gmail.com> Co-authored-by: fedor28 <37560717+fedor28@users.noreply.github.com>
mrshu
pushed a commit
to slovak-nlp/mteb
that referenced
this pull request
Oct 14, 2025
* Add en code retriever model * fix model_name * Update mteb/models/en_code_retriever.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * correct lint --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Samoed
added a commit
that referenced
this pull request
Oct 22, 2025
* model: add image support for jina embeddings v4 (#2893)
* feat: unify text and image embeddings for all tasks
* fix: uniform batch size
* fix: update error message
* fix: update code task
* fix: update max length
* fix: apply review suggestions
* model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889)
* feat: add KaLM_Embedding_X_0605 in kalm_models
* Update kalm_models.py for lint format
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
---------
Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
* Add Classification Evaluator unit test (#2838)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: update colpali engine models (#2905)
* adding vidore benchmarks
* fix typo
* clean vidore names + per lang eval
* lint
* vidore names
* bibtex fix
* fix revision
* vidore v2 citation
* update citation format and fix per-language mappings
* lint: citations
* typo citations
* fix revisiions
* lint
* fix colnomic3b revision
* fix colqwen2.5 revision + latest repo version
* fix query agmentation tokens
* colsmol revision
* 1.38.35
Automatically generated by python-semantic-release
* Evaluator tests (#2910)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
* Adding STSEvaluator and SummarizationEvaluator tests
* Correcting due to the comments
* Correcting due to the comments
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Classification dataset cleaning (#2900)
* Classification dataset cleaning
* Update pull request number
* Fix metadata test
* fix formatting
* add script for cleaning
* Update tasks & benchmarks tables
* dataset: Add JapaneseSentimentClassification (#2913)
Add JapaneseSentimentClassification
* Update tasks & benchmarks tables
* fix: change `passage` prompt to `document` (#2912)
* change document to passage
* fix prompt names
* fix kwargs check
* fix default prompt
* 1.38.36
Automatically generated by python-semantic-release
* model: Add OpenSearch inf-free sparse encoding models (#2903)
add opensearch inf-free models
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* dataset: add BarExamQA dataset (#2916)
* Add BareExamQA retrieval task
* ran linter
* updated details
* updated details
* fixed subtype name
* fixed changes
* ran linter again
* Use `mteb.get_model` in adding_a_dataset.md (#2922)
Update adding_a_dataset.md
* fix: specify revision for opensearch (#2919)
specify revision for opensearch
* 1.38.37
Automatically generated by python-semantic-release
* Update the link for gemini-embedding-001 (#2928)
* fix: replace with passage (#2934)
* fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940)
* fix: Only import SparseEncoder once sentence-transformer version have been checked
fixes #2936
* Update mteb/models/opensearch_neural_sparse_models.py
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939)
The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.
* docs: Update adding_a_dataset.md (#2947)
* docs: Update adding_a_dataset.md
* Update docs/adding_a_dataset.md
* ci: bump semantic release
* 1.38.38
Automatically generated by python-semantic-release
* dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935)
* BSARD loader fixed
* BSARDv2 metadata fixed
* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tasks & benchmarks tables
* dataset: add GovReport dataset (#2953)
* Added govreport task
* Updated description
* dataset: add BillSum datasets (#2943)
* Added BillSum datasets
* fixed billsumca
* Updated BillSumCA description
* Updated BillSumUS description
* Update mteb/tasks/Retrieval/eng/BillSumCA.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/BillSumUS.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* lint
* lint
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716)
* Add RuSciBench
* fix bitext mining lang
* Add regression task
* fix init
* add missing files
* Improve description
* Add superseded_by
* fix lint
* Update regression task to match with v2
* Add stratified_subsampling for regression task
* Add boostrap for regression task
* Rename task class, add model as evaluator argument
* fix import
* fix import 2
* fixes
* fix
* Rename regression model protocol
* Update tasks & benchmarks tables
* 1.38.39
Automatically generated by python-semantic-release
* qzhou-embedding model_meta & implementation (#2975)
* qzhou-embedding model_meta & implementation
* Update qzhou_models.py
* Update qzhou_models.py
Processing todo items(Add default instruction)
* Update qzhou_models.py
correct bge datalist
* Update qzhou_models.py
correct 'public_training_data'
* Update qzhou_models.py
* Update qzhou_models.py
* Update qzhou_models.py
* Update qzhou_models.py
* Update mteb/models/qzhou_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/qzhou_models.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* format qzhou_models.py for ruff check
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* model: Add Voyage 3.5 model configuration (#3005)
Add Voyage 3.5 model configuration
- Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens
- Set release date to 2025-01-21 with revision 1
- Configure for cosine similarity with instruction support
- Include standard Voyage training datasets reference
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
* model: BAAI/bge-m3-unsupervised Model (#3007)
* Add BAAI/bge-m3-unsupervised Model
(BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out)
* Remove the commented retromae model
---------
Co-authored-by: fzowl <zoltan@voyageai.com>
* lint: Correcting lint errors (#3004)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
* Correcting the lint errors
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* dataset: Added 50 Vietnamese dataset from vn-mteb (#2964)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [REMOVE] default fields metadata in Classfication tasks
* Update tasks & benchmarks tables
* model: Add Cohere embed-v4.0 model support (#3006)
* Add Cohere embed-v4.0 model support
- Add text-only embed-v4.0 model in cohere_models.py
- Add multimodal embed-v4.0 model in cohere_v.py
- Support configurable dimensions (256, 512, 1024, 1536)
- Support 128,000 token context length
- Support multimodal embedding (text, images, mixed PDFs)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add Cohere embed-v4.0 model support
Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Add OpenAI models with 512 dimension (#3008)
* Add OpenAI/text-embedding-3-small (512 dim)
Add OpenAI/text-embedding-3-large (512 dim)
* Correcting due to comments
---------
Co-authored-by: fzowl <zoltan@voyageai.com>
* Standardise task names and fix citation formatting (#3026)
fixes for name formatting
* Update tasks & benchmarks tables
* fix: Add missing training sets for qzhou (#3023)
* Supplement missing training sets
* reformat code
* Reorganize the data list format
* update qzhou_model meta
* 1.38.40
Automatically generated by python-semantic-release
* model: Add samilpwc_models meta (#3028)
* model: Add samilpwc_models meta
* Fix: Remove CONST
* Fix: Reformat File
* Update: model revision
* model: Add granite-vision-embedding model (#3029)
* Add files via upload
* Address review comments
* Address review comments
* ruff format
* Update mteb/models/granite_vision_embedding_models.py
* lint error fix
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: incorrect revision for SNLRetrieval (#3033)
The provided revisions doesn't seem to be present on:
adrlau/navjordj-SNL_summarization_copy
Replacing with latest revision
* dataset: Add HumanEvalRetrieval task (#3022)
* Add HumanEvalRetrieval dataset
* Fix TaskMetadata structure and remove descriptive_stats
- Use TaskMetadata class instead of dict
- Remove descriptive_stats as requested in PR review
- Add date field and proper import structure
* Fix dataset path and use verified metadata
- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval
- Use actual description from HuggingFace dataset page
- Remove fabricated citation and reference
- Remove revision field that was incorrect
- Reference HuggingFace dataset page instead of arxiv
* Add correct revision hash to HumanEval
- Add revision hash: ed1f48a for reproducibility
* Fix HumanEval metadata validation
- Add date field for metadata completeness
- Add bibtex_citation field (empty string)
- Required for TaskMetadata validation to pass
- Should resolve PR test failure
* Address reviewer feedback
- Remove trust_remote_code parameter as requested
- Add revision parameter to load_dataset() calls for consistency
- Use metadata revision hash in dataset loading for reproducibility
* Fix field names in HumanEval dataset loading
Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.
* Fix deprecated metadata_dict usage
Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.
* Fix data structure for MTEB compatibility
- Organize data by splits as expected by MTEB retrieval tasks
- Convert scores to integers for pytrec_eval compatibility
* Address PR feedback for HumanEval dataset
- Add descriptive statistics using calculate_metadata_metrics()
- Enhance metadata description with dataset structure details
- Add complete BibTeX citation for original paper
- Update to full commit hash revision
- Add python-Code language tag for programming language
- Explain retrieval task formulation clearly
* Fix BibTeX citation formatting for HumanEvalRetrieval
- Update citation to match bibtexparser formatting requirements
- Fields now in alphabetical order with lowercase names
- Proper trailing commas and indentation
* Update tasks & benchmarks tables
* 1.38.41
Automatically generated by python-semantic-release
* ci: reduce parallel runs for when checking if a dataset exists (#3035)
The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831)
* ci: Updating rerun delays to prevent false positives errors
* ci: Updating rerun delays to prevent false positives errors
* model: Add GreenNode Vietnamese Embedding models (#2994)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] Vietnamese Embedding models
* [REMOVE] default fields metadata in Classfication tasks
* [UPDATE] model to vi-vn language specific file
* [FIX] lint
* [FIX] model loader
* model: add granite-embedding-english R2 models (#3050)
* fix: Updated revision for jina-embeddings-v4 (#3046)
* fix: jinav4 revision
Signed-off-by: admin <bo.wang@jina.ai>
* change revision instead of removing it
Signed-off-by: admin <bo.wang@jina.ai>
---------
Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: admin <bo.wang@jina.ai>
* 1.38.42
Automatically generated by python-semantic-release
* Fix 3 VN-MTEB Pair Classification tasks (#3053)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] Vietnamese Embedding models
* [REMOVE] default fields metadata in Classfication tasks
* [UPDATE] model to vi-vn language specific file
* [FIX] lint
* [FIX] model loader
* [FIX] VN-MTEB 3 datasets PairClassification rename column
* dataset: Add mbpp retrieval (#3037)
* Add MBPP retrieval task
- Code retrieval task based on 378 Python programming problems
- Natural language queries matched to Python code implementations
- Uses python-Code evaluation language for code-specific metrics
- Includes proper citations and descriptive statistics
* Add MBPPRetrieval to imports
* Add descriptive statistics for MBPPRetrieval
* Reformatting
* Reformatting
* Update tasks & benchmarks tables
* dataset: Added wikisql retrieval (#3039)
* Add WikiSQL retrieval task
- Code retrieval task based on WikiSQL natural language to SQL dataset
- Natural language questions matched to SQL query implementations
- Uses sql-Code evaluation language for SQL-specific metrics
- Includes proper citations and descriptive statistics
* Add WikiSQLRetrieval to imports
* Add descriptive statistics for WikiSQLRetrieval
* Reformatting
* Reformatting
* Reformatting, correcting the revision
* Update tasks & benchmarks tables
* ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors
try to fix CI
* fix MBPPRetrieval revision (#3055)
Update MBPPRetrieval.py
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* fix: Add VN-MTEB benchmark and Leaderboard (#2995)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] VN-MTEB benchmark and leaderboard
* [FIX] wrong benchmark name
* [REMOVE] default fields metadata in Classfication tasks
* Update tasks & benchmarks tables
* 1.38.43
Automatically generated by python-semantic-release
* Add hc3finance retrieval (#3041)
* Add HC3Finance retrieval task
- Financial retrieval task based on HC3 Finance dataset
- Financial questions matched to human and AI-generated content
- Covers financial explanations, analysis, and educational content
- Includes proper citations and descriptive statistics
* Add HC3FinanceRetrieval to imports
* Add descriptive statistics for HC3FinanceRetrieval
* Reformatting
* Reformatting, correcting the revision
* Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add finqa retrieval (#3042)
* Add FinQA retrieval task
- Financial numerical reasoning retrieval task based on FinQA dataset
- Numerical financial questions matched to relevant document data
- Covers earnings reports with tables and quantitative financial data
- Includes proper citations and descriptive statistics
* Add FinQARetrieval to imports
* Add descriptive statistics for FinQARetrieval
* Reformatting
* Reformatting
* Update mteb/tasks/Retrieval/eng/FinQARetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* Add FinanceBenchRetrieval task (#3044)
* Add FinanceBenchRetrieval
* Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* Add FreshStackRetrieval task (#3043)
* Add FreshStackRetrieval
* Reformatting, correcting the revision
* Dataset correction
* Update tasks & benchmarks tables
* dataset: Add ds1000 retrieval (#3038)
* Add DS1000 retrieval task
- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries
* Add DS1000Retrieval to imports
* Add descriptive statistics for DS1000Retrieval
* Reformatting
* Reformatting
* Update tasks & benchmarks tables
* Add ChatDoctorRetrieval (#3045)
* Add ChatDoctorRetrieval
* Reformatting, correcting the revision
* Correct the dataset citation
* Correcting due to comments
* Update tasks & benchmarks tables
* Correcting the (new) DS1000 dataset's revision (#3063)
* Add DS1000 retrieval task
- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries
* Add DS1000Retrieval to imports
* Add descriptive statistics for DS1000Retrieval
* Reformatting
* Reformatting
* Add DS1000Retrieval task implementation
* dataset: Add JinaVDR (#2942)
* feat: added jinavdr benchmark
* feat: added description for jinavdr
* feat: fixed licenses and added bibtex
* feat: made jinav4 compatible with vidore benchmark
* feat: corrected query numbers
* feat: removed print
* feat: added max pixel argument for jina models
* feat: score calculation on cpu
* feat: adjust jina model for new mteb code
* feat: code cleanup
* feat: corrected bibtex
* feat: make colpali run with jinavdr
* feat: fixed comments
* feat: better reference and fixed comments
* feat: added date for tasks
* feat: fixed missing metadata and bibtex
* feat: added descriptions per dataset
* Update tasks & benchmarks tables
* model: Add CoDi-Embedding-V1 (#3054)
* add codiemb-minicpm
* replace codiemb_minicpm with codi_model
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* update code
* update code
* reformat
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: ensure that there are always relevant docs attached to query (#3058)
* fix: ensure that there are always relevant docs attached to query
Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")
eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())
# before fix:
res[0].get_score() # np.float64(0.02837)
res[0].scores
before_fix = {
"train": [
{
"ndcg_at_1": 0.02597,
"ndcg_at_3": 0.02213,
"ndcg_at_5": 0.0262,
"ndcg_at_10": 0.02837,
"ndcg_at_20": 0.04548,
"ndcg_at_100": 0.13527,
"ndcg_at_1000": 0.24507,
"map_at_1": 0.00866,
"map_at_3": 0.01317,
"map_at_5": 0.0149,
"map_at_10": 0.01562,
"map_at_20": 0.01898,
"map_at_100": 0.02968,
"map_at_1000": 0.03841,
"recall_at_1": 0.00866,
"recall_at_3": 0.02056,
"recall_at_5": 0.02922,
"recall_at_10": 0.03355,
"recall_at_20": 0.08268,
"recall_at_100": 0.43766,
"recall_at_1000": 1.0,
"precision_at_1": 0.02597,
"precision_at_3": 0.02165,
"precision_at_5": 0.01818,
"precision_at_10": 0.01039,
"precision_at_20": 0.01234,
"precision_at_100": 0.01481,
"precision_at_1000": 0.0034,
"mrr_at_1": 0.025974,
"mrr_at_3": 0.041126,
"mrr_at_5": 0.04632,
"mrr_at_10": 0.048485,
"mrr_at_20": 0.058356,
"mrr_at_100": 0.070186,
"mrr_at_1000": 0.071349,
"nauc_ndcg_at_1_max": 0.33969,
"nauc_ndcg_at_1_std": -0.202864,
"nauc_ndcg_at_1_diff1": -0.127,
"nauc_ndcg_at_3_max": 0.409376,
"nauc_ndcg_at_3_std": -0.039352,
"nauc_ndcg_at_3_diff1": -0.022816,
"nauc_ndcg_at_5_max": 0.250499,
"nauc_ndcg_at_5_std": -0.115263,
"nauc_ndcg_at_5_diff1": -0.057017,
"nauc_ndcg_at_10_max": 0.238696,
"nauc_ndcg_at_10_std": -0.138396,
"nauc_ndcg_at_10_diff1": -0.045287,
"nauc_ndcg_at_20_max": 0.154456,
"nauc_ndcg_at_20_std": -0.070635,
"nauc_ndcg_at_20_diff1": 0.074499,
"nauc_ndcg_at_100_max": -0.005753,
"nauc_ndcg_at_100_std": -0.074738,
"nauc_ndcg_at_100_diff1": -0.005851,
"nauc_ndcg_at_1000_max": 0.109439,
"nauc_ndcg_at_1000_std": -0.089797,
"nauc_ndcg_at_1000_diff1": -0.021634,
"nauc_map_at_1_max": 0.33969,
"nauc_map_at_1_std": -0.202864,
"nauc_map_at_1_diff1": -0.127,
"nauc_map_at_3_max": 0.385244,
"nauc_map_at_3_std": -0.080638,
"nauc_map_at_3_diff1": -0.060991,
"nauc_map_at_5_max": 0.294871,
"nauc_map_at_5_std": -0.119069,
"nauc_map_at_5_diff1": -0.06234,
"nauc_map_at_10_max": 0.285698,
"nauc_map_at_10_std": -0.132856,
"nauc_map_at_10_diff1": -0.055015,
"nauc_map_at_20_max": 0.236619,
"nauc_map_at_20_std": -0.100673,
"nauc_map_at_20_diff1": -0.002619,
"nauc_map_at_100_max": 0.15345,
"nauc_map_at_100_std": -0.138888,
"nauc_map_at_100_diff1": -0.02257,
"nauc_map_at_1000_max": 0.171402,
"nauc_map_at_1000_std": -0.134644,
"nauc_map_at_1000_diff1": -0.034477,
"nauc_recall_at_1_max": 0.33969,
"nauc_recall_at_1_std": -0.202864,
"nauc_recall_at_1_diff1": -0.127,
"nauc_recall_at_3_max": 0.375072,
"nauc_recall_at_3_std": -0.009643,
"nauc_recall_at_3_diff1": -0.089168,
"nauc_recall_at_5_max": 0.147691,
"nauc_recall_at_5_std": -0.128654,
"nauc_recall_at_5_diff1": -0.084259,
"nauc_recall_at_10_max": 0.141055,
"nauc_recall_at_10_std": -0.165932,
"nauc_recall_at_10_diff1": -0.060966,
"nauc_recall_at_20_max": 0.043863,
"nauc_recall_at_20_std": -0.028374,
"nauc_recall_at_20_diff1": 0.157575,
"nauc_recall_at_100_max": -0.157183,
"nauc_recall_at_100_std": -0.019437,
"nauc_recall_at_100_diff1": 0.013395,
# "nauc_recall_at_1000_max": nan,
# "nauc_recall_at_1000_std": nan,
# "nauc_recall_at_1000_diff1": nan,
"nauc_precision_at_1_max": 0.33969,
"nauc_precision_at_1_std": -0.202864,
"nauc_precision_at_1_diff1": -0.127,
"nauc_precision_at_3_max": 0.406318,
"nauc_precision_at_3_std": 0.007031,
"nauc_precision_at_3_diff1": -0.034709,
"nauc_precision_at_5_max": 0.178131,
"nauc_precision_at_5_std": -0.112493,
"nauc_precision_at_5_diff1": -0.045535,
"nauc_precision_at_10_max": 0.167897,
"nauc_precision_at_10_std": -0.150626,
"nauc_precision_at_10_diff1": -0.027811,
"nauc_precision_at_20_max": 0.081428,
"nauc_precision_at_20_std": -0.042304,
"nauc_precision_at_20_diff1": 0.17278,
"nauc_precision_at_100_max": -0.150619,
"nauc_precision_at_100_std": 0.016133,
"nauc_precision_at_100_diff1": -0.065571,
"nauc_precision_at_1000_max": -0.017244,
"nauc_precision_at_1000_std": 0.046614,
"nauc_precision_at_1000_diff1": -0.028258,
"nauc_mrr_at_1_max": 0.33969,
"nauc_mrr_at_1_std": -0.202864,
"nauc_mrr_at_1_diff1": -0.127,
"nauc_mrr_at_3_max": 0.409511,
"nauc_mrr_at_3_std": -0.064671,
"nauc_mrr_at_3_diff1": -0.01911,
"nauc_mrr_at_5_max": 0.319584,
"nauc_mrr_at_5_std": -0.103546,
"nauc_mrr_at_5_diff1": -0.025109,
"nauc_mrr_at_10_max": 0.309614,
"nauc_mrr_at_10_std": -0.117564,
"nauc_mrr_at_10_diff1": -0.019691,
"nauc_mrr_at_20_max": 0.262976,
"nauc_mrr_at_20_std": -0.092222,
"nauc_mrr_at_20_diff1": 0.024507,
"nauc_mrr_at_100_max": 0.256052,
"nauc_mrr_at_100_std": -0.094249,
"nauc_mrr_at_100_diff1": 0.012432,
"nauc_mrr_at_1000_max": 0.260112,
"nauc_mrr_at_1000_std": -0.098845,
"nauc_mrr_at_1000_diff1": 0.009697,
"main_score": 0.02837,
"hf_subset": "default",
"languages": ["dan-Latn"],
}
]
}
# with update:
res[0].get_score() # np.float64(0.02837)
res[0].scores
with_fix = {
"train": [
{
"ndcg_at_1": 0.02597,
"ndcg_at_3": 0.02213,
"ndcg_at_5": 0.0262,
"ndcg_at_10": 0.02837,
"ndcg_at_20": 0.04548,
"ndcg_at_100": 0.13527,
"ndcg_at_1000": 0.24507,
"map_at_1": 0.00866,
"map_at_3": 0.01317,
"map_at_5": 0.0149,
"map_at_10": 0.01562,
"map_at_20": 0.01898,
"map_at_100": 0.02968,
"map_at_1000": 0.03841,
"recall_at_1": 0.00866,
"recall_at_3": 0.02056,
"recall_at_5": 0.02922,
"recall_at_10": 0.03355,
"recall_at_20": 0.08268,
"recall_at_100": 0.43766,
"recall_at_1000": 1.0,
"precision_at_1": 0.02597,
"precision_at_3": 0.02165,
"precision_at_5": 0.01818,
"precision_at_10": 0.01039,
"precision_at_20": 0.01234,
"precision_at_100": 0.01481,
"precision_at_1000": 0.0034,
"mrr_at_1": 0.025974,
"mrr_at_3": 0.041126,
"mrr_at_5": 0.04632,
"mrr_at_10": 0.048485,
"mrr_at_20": 0.058356,
"mrr_at_100": 0.070186,
"mrr_at_1000": 0.071349,
"nauc_ndcg_at_1_max": 0.33969,
"nauc_ndcg_at_1_std": -0.202864,
"nauc_ndcg_at_1_diff1": -0.127,
"nauc_ndcg_at_3_max": 0.409376,
"nauc_ndcg_at_3_std": -0.039352,
"nauc_ndcg_at_3_diff1": -0.022816,
"nauc_ndcg_at_5_max": 0.250499,
"nauc_ndcg_at_5_std": -0.115263,
"nauc_ndcg_at_5_diff1": -0.057017,
"nauc_ndcg_at_10_max": 0.238696,
"nauc_ndcg_at_10_std": -0.138396,
"nauc_ndcg_at_10_diff1": -0.045287,
"nauc_ndcg_at_20_max": 0.154456,
"nauc_ndcg_at_20_std": -0.070635,
"nauc_ndcg_at_20_diff1": 0.074499,
"nauc_ndcg_at_100_max": -0.005753,
"nauc_ndcg_at_100_std": -0.074738,
"nauc_ndcg_at_100_diff1": -0.005851,
"nauc_ndcg_at_1000_max": 0.109439,
"nauc_ndcg_at_1000_std": -0.089797,
"nauc_ndcg_at_1000_diff1": -0.021634,
"nauc_map_at_1_max": 0.33969,
"nauc_map_at_1_std": -0.202864,
"nauc_map_at_1_diff1": -0.127,
"nauc_map_at_3_max": 0.385244,
"nauc_map_at_3_std": -0.080638,
"nauc_map_at_3_diff1": -0.060991,
"nauc_map_at_5_max": 0.294871,
"nauc_map_at_5_std": -0.119069,
"nauc_map_at_5_diff1": -0.06234,
"nauc_map_at_10_max": 0.285698,
"nauc_map_at_10_std": -0.132856,
"nauc_map_at_10_diff1": -0.055015,
"nauc_map_at_20_max": 0.236619,
"nauc_map_at_20_std": -0.100673,
"nauc_map_at_20_diff1": -0.002619,
"nauc_map_at_100_max": 0.15345,
"nauc_map_at_100_std": -0.138888,
"nauc_map_at_100_diff1": -0.02257,
"nauc_map_at_1000_max": 0.171402,
"nauc_map_at_1000_std": -0.134644,
"nauc_map_at_1000_diff1": -0.034477,
"nauc_recall_at_1_max": 0.33969,
"nauc_recall_at_1_std": -0.202864,
"nauc_recall_at_1_diff1": -0.127,
"nauc_recall_at_3_max": 0.375072,
"nauc_recall_at_3_std": -0.009643,
"nauc_recall_at_3_diff1": -0.089168,
"nauc_recall_at_5_max": 0.147691,
"nauc_recall_at_5_std": -0.128654,
"nauc_recall_at_5_diff1": -0.084259,
"nauc_recall_at_10_max": 0.141055,
"nauc_recall_at_10_std": -0.165932,
"nauc_recall_at_10_diff1": -0.060966,
"nauc_recall_at_20_max": 0.043863,
"nauc_recall_at_20_std": -0.028374,
"nauc_recall_at_20_diff1": 0.157575,
"nauc_recall_at_100_max": -0.157183,
"nauc_recall_at_100_std": -0.019437,
"nauc_recall_at_100_diff1": 0.013395,
# "nauc_recall_at_1000_max": nan,
# "nauc_recall_at_1000_std": nan,
# "nauc_recall_at_1000_diff1": nan,
"nauc_precision_at_1_max": 0.33969,
"nauc_precision_at_1_std": -0.202864,
"nauc_precision_at_1_diff1": -0.127,
"nauc_precision_at_3_max": 0.406318,
"nauc_precision_at_3_std": 0.007031,
"nauc_precision_at_3_diff1": -0.034709,
"nauc_precision_at_5_max": 0.178131,
"nauc_precision_at_5_std": -0.112493,
"nauc_precision_at_5_diff1": -0.045535,
"nauc_precision_at_10_max": 0.167897,
"nauc_precision_at_10_std": -0.150626,
"nauc_precision_at_10_diff1": -0.027811,
"nauc_precision_at_20_max": 0.081428,
"nauc_precision_at_20_std": -0.042304,
"nauc_precision_at_20_diff1": 0.17278,
"nauc_precision_at_100_max": -0.150619,
"nauc_precision_at_100_std": 0.016133,
"nauc_precision_at_100_diff1": -0.065571,
"nauc_precision_at_1000_max": -0.017244,
"nauc_precision_at_1000_std": 0.046614,
"nauc_precision_at_1000_diff1": -0.028258,
"nauc_mrr_at_1_max": 0.33969,
"nauc_mrr_at_1_std": -0.202864,
"nauc_mrr_at_1_diff1": -0.127,
"nauc_mrr_at_3_max": 0.409511,
"nauc_mrr_at_3_std": -0.064671,
"nauc_mrr_at_3_diff1": -0.01911,
"nauc_mrr_at_5_max": 0.319584,
"nauc_mrr_at_5_std": -0.103546,
"nauc_mrr_at_5_diff1": -0.025109,
"nauc_mrr_at_10_max": 0.309614,
"nauc_mrr_at_10_std": -0.117564,
"nauc_mrr_at_10_diff1": -0.019691,
"nauc_mrr_at_20_max": 0.262976,
"nauc_mrr_at_20_std": -0.092222,
"nauc_mrr_at_20_diff1": 0.024507,
"nauc_mrr_at_100_max": 0.256052,
"nauc_mrr_at_100_std": -0.094249,
"nauc_mrr_at_100_diff1": 0.012432,
"nauc_mrr_at_1000_max": 0.260112,
"nauc_mrr_at_1000_std": -0.098845,
"nauc_mrr_at_1000_diff1": 0.009697,
"main_score": 0.02837,
"hf_subset": "default",
"languages": ["dan-Latn"],
}
]
}
# check
with_fix == before_fix # True
* restructure
* format
* relax pytrec versions
* fix incorrect parsing
* 1.38.44
Automatically generated by python-semantic-release
* Correcting the JINA models with SentenceTransformerWrapper (#3071)
* ci: Add stale workflow (#3066)
* add stale workflow
* add permissions
* add bug label to bug issue template
* revert bug issue and only look at more info needed issues
* more accurate name
* override default
* fix: open_clip package validation (#3073)
* 1.38.45
Automatically generated by python-semantic-release
* fix: Update revision for qzhou models (#3069)
* 1.38.46
Automatically generated by python-semantic-release
* Fix the reference link for CoDi-Embedding-V1 (#3075)
Fix reference link
* fix: Add beta version of RTEB related benchmarks (#3048)
* Add RTEB related benchmarks
* Add RTEB related benchmarks
* Correcting the task names in the RTEB benchmarks
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Adding the CURE dataset to RTEB benchmarks
* Use the right language subset
* Fix broken finance icon URL in RTEB benchmarks
Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg
Validated all icon URLs and confirmed accessibility compliance
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* 1.38.47
Automatically generated by python-semantic-release
* fix: run `ruff check` on all files during ci (#3086)
* fix: run `ruff check` on all files during ci
* format
* 1.38.48
Automatically generated by python-semantic-release
* Move dev to dependency groups (#3088)
add dependency groups
* fix: Improving validate_task_to_prompt_name logs and error messages (#3079)
* Improving validate_task_to_prompt_name logs and error messages
* linter fixes
* Adding None prompts tests
* Update test_benchmark_sentence_transformer
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: duplicate mteb multilingual variables (#3080)
* fix benchmark naming
* format
* lint
* Update tasks & benchmarks tables
* model: mdbr-leaf models (#3081)
* added MDBR leaf models
* fixed revision for mdbr-leaf-ir
* added model prompts
* updated training datasets
* fixed linting
* lotte task reference
---------
Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>
* 1.38.49
Automatically generated by python-semantic-release
* CI: Set upper limit for xdist version (#3098)
* Commentout bibtex formatting
* Remove `-n auto`
* get back bibtex
* try limiting versions
* revert coverage
* revert coverage
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Combine Plots and Tables into a Single (#3047)
* feat - Combine Plots and Tables into a Single Tab #3009
* feat - Resize the plot to make it more readable
* feat - Remove the (radar chart)
* feat - Add a comment stating that it only shows the Top 5 models in the table.
* feat - adjust layout
* Update mteb/leaderboard/app.py
* format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* fix: Updating the default batch size calculation in the voyage models (#3091)
* 1.38.50
Automatically generated by python-semantic-release
* fix: Add @classmethod for @field_validators in TaskMetadata (#3100)
* Align task prompt dict with `PromptType` (#3101)
* align task prompt dict with `PromptType`
* use value instead of enum
* 1.38.51
Automatically generated by python-semantic-release
* model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090)
* Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1
* Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth)
* Format with ruff + add loader per review
* Apply ruff format/fixes
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader)
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix import
* Add memory_usage_mb=808.0 and required fields to ModelMeta
* Fix 210 milions of parameters
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: Allow closed datasets (#3059)
* - Added an include_private parameter to the get_tasks() function that defaults to False
- This ensures that by default, tests only run on public datasets
- Tests can explicitly set include_private=True when needed to test private datasets
- Added is_public: bool | None = None field to TaskMetadata
- The field is optional and defaults to None (treated as public)
- Updated the is_filled() method to exclude is_public from required fields
- Added documentation
* - Added an include_private parameter to the get_tasks() function that defaults to False
- This ensures that by default, tests only run on public datasets
- Tests can explicitly set include_private=True when needed to test private datasets
- Added is_public: bool | None = None field to TaskMetadata
- The field is optional and defaults to None (treated as public)
- Updated the is_filled() method to exclude is_public from required fields
- Added documentation
* Correcting due to comments
* Update mteb/abstasks/TaskMetadata.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/overview.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Removing the not used filter_tasks_by_privacy function
* Correcting due to comments
* Correcting due to comments
* Correcting due to comments
* Removing the test case
* Rename the include_private parameter to exclude_private
* Rename the include_private parameter to exclude_private
* Add private tasks tests
* Add private tasks tests
* Update tests/test_tasks/test_private_tasks.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Add private tasks tests
* Add private tasks tests
* Add private tasks tests
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.52
Automatically generated by python-semantic-release
* Ci: test out GH models with welcoming new comers (#3112)
test out GH models with welcoming new comers
* ci: Dataset check on new PR (#3103)
* add dataset check on new PR
* add extract datasets
* run as module
* update startswith
* update workflow name
* add GitPython
* export var
* same shell session
* address review comments
* add to docs to say what this script does
* add docs
* model: add Youtu-Embedding-V1 (#3115)
* add youtu models
* add a blank line
* fix the optional dependencies and lint the code
* remove unused dependencies and reformat
* revise prompt_type
---------
Co-authored-by: springxchen <springxchen@tencent.com>
* fix: add voyage quantization models (#3092)
* Adding quantization support
* Update mteb/models/voyage_models.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Simplifying the quantization/output_dtype
* Update mteb/model_meta.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* 1.38.53
Automatically generated by python-semantic-release
* model: EmbeddingGemma 300M (#3129)
* model: EmbeddingGemma 300M
* Add license and revision
* fix: Add dedicated display for RTEB benchmark results (#3089)
* feat - remove special filtering, keep zero-shot, keep borda rank
* feat - remove get_rteb_benchmark.py
* feat - delete get_rteb_benchmark.py;RTEB_BENCHMARK_ENTRIES changes
* feat -format
* Update mteb/load_results/benchmark_results.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* 1.38.54
Automatically generated by python-semantic-release
* dataset: Add Dapfam patent retrieval tasks (#2946)
* chore: add 'Patent retrieval' subtype to TaskMetadata
* feat(retrieval): add DAPFAM patent retrieval tasks (+18 variants)
* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)
* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Changes :
- Added possibility to opt in or out of quantization through the "quantize" argument.
- Added possibility to compute raw dot product without normalization. (to reproduce the paper method the "similarity" argument should be "cosine").
- Removed unecessary function and overhauled the tasks descriptions to be more clear.
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Changes made :
- Overhauled task descriptions as well as naming to conform with the naming scheme of mteb retrieval tasks.
- Similarity is now computed using the similarity function of the passed model.
- Changed optional quantization method to conform with sentence transformers similarity function.
to reproduce the paper metrics, one can use the following snippet :
```python
from mteb import mteb
from sentence_transformers import SentenceTransformer
model_name = "Snowflake/snowflake-arctic-embed-m-v2.0"
model = SentenceTransformer(model_name,
model_kwargs={
"torch_dtype": "float16",
},
trust_remote_code=True,
).cuda().eval()
tasks = mteb.get_tasks(tasks=[
"DAPFAMInTitlAbsToTitlAbsClmRetrieval",
"DAPFAMAllTitlAbsToTitlAbsClmRetrieval",
"DAPFAMOutTitlAbsToTitlAbsClmRetrieval",
add the other 3 remaining tasks ...
])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
output_folder=f"mteb_res/{model_name}",
quantize=True, # if set to false or not set, the obtained ndcg@10 and map@10 will be ~0.001 higher
encode_kwargs= {"batch_size" : 32}
)
```
* changed default value of quantization to false
* added the import to all DAPFAM tasks; tested that the works; verified compliance with the checklist
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* added revision numbers to all dataset loading operations as well as the metadata itself
* intermediate changes, refresh local branch
* intermediate changes, refresh local branch again
* scale back to standard evaluation with empty set exclusion, various cosmetic/formatting changes
* minor cosmetic/formatting changes
* fixed main metric to be ndcg_at_100 as in the paper
* removed old code artifacts from previous versions
* read appropriate loading arguments from task metadata, remove unecessary class attribute
* reformat bibtex ( remark on the assertion since it tries to match literal string instead of bibtex formatting, format inconsistent with arXiv default), fixed metadata, parameters read from task metadata, all tests passed
* refactor data loading to read from metadata class attributes
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* Align max tokens (#3172)
* Correct the VoyageAI model's batch creation/batch size calculation (#3185)
Correct the batch creation
* dataset: Adding JapaneseCode1Retrieval as the first non-public dataset (#3168)
* Adding JapaneseCode1Retrieval as the first non-public dataset
* Transformed dataset
* Adding as private dataset to tests
* Correct the private task test
* Use the sample dataset as a reference
* Use the sample dataset as a reference
* fix ds loading
* allow on forks
* upd aciton
* remove paths
* try to trigger ci
* add ref
* add permissions
* remove paths
* add paths back
* get back to pull request
* rollback action
* Trying to resolve the token/secret problem
* Trying to resolve the token/secret problem
* Update dataset_loading_pr.yml
* Update dataset_loading_pr.yml
* Try the latest datasets package (worked for me)
* Try the latest datasets package (worked for me)
* Try the latest datasets package (worked for me)
* (last?) try
* (last?) try
* (last?) try
* Reverting the changes
* Exclude the private datasets from tests
* Apply suggestions from code review
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Solomatin Roman <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: add version check for `embeddinggemma-300m` (#3189)
add version check
* dataset: Added a set of closed datasets (#3186)
* Add 12 more closed datasets
Extend the RTEB benchmarks
* trust_remote_code
* trust_remote_code
* Enabling JapaneseCode1Retrieval in the RTEB benchmarks
* Add closed datasets as private tasks
* Correct due to the comment
* Update tasks & benchmarks tables
* fix: Edit ack & sponsors (#3187)
* dataset: Update FaMTEB to Version 2 (#3157)
* Update benchmark to version 2
* make others in benchmark selector one line code
* small changes
* update a few tasks metadata
* update faintent license with correct form
* remove redundant trust remote codes
* fix hardnegatives revision
* make lint
* fix errors
* apply suggestions
* fix citation problem
* add PR link to benchmark desc
* remove duplicate dataset names in mcinext_models
* update prompts
---------
Co-authored-by: mehran <mehan.sarmadi16@gmail.com>
* Update tasks & benchmarks tables
* 1.38.55
Automatically generated by python-semantic-release
* fix: Add conflicting dependencies to toml (#3191)
fix conflict dependencies
* 1.38.56
Automatically generated by python-semantic-release
* fix: Correct metadata for ArguAna dataset (#3202)
* Update tasks & benchmarks tables
* 1.38.57
Automatically generated by python-semantic-release
* model: Add BMRetriever (#3195)
* model: Add BMRetriever
* Update mteb/models/bmretriever_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/bmretriever_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: remove trust_remote_code option
* feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper
* refactor: update training datasets for bmretriever
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Revert "Ci: test out GH models with welcoming new comers" (#3206)
Revert "Ci: test out GH models with welcoming new comers (#3112)"
This reverts commit 73a35e0bb02e61108d50385f4c43fd7d1b16e984.
* model: Add Codefuse models (#3205)
* add codefuse models
* add codefuse models
* Update codefuse_models.py
* lint codefuse.py
* fix(models): ensure prompt_type is passed to format_instruction (#3216)
* 1.38.58
Automatically generated by python-semantic-release
* Adding Cohere's output_dimension and embedding_type parameter (#3204)
* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8
* Correcting due to comments
* dataset: add swedish cpc patent classifications to mteb (#3072)
* feat: add swedish cpc patent classifications to mteb
* fix: formatting and init imports
* fix: update mteb task according to feedback
* fix: perform citation and code formatting
* fix: add train and test split for both datasets
* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior
* chore: fix colpali_models similarity handle device
* Update tasks & benchmarks tables
* 1.38.59
Automatically generated by python-semantic-release
* fix: prevent EOS token truncation (#3218)
* fix(models): prevent EOS token truncation for BMRetriever
* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`
* fix(models): correct eos token handling in `BMRetrieverWrapper`
* 1.38.60
Automatically generated by python-semantic-release
* Update giga embeddings (#3210)
* update giga embeddings
* update giga embeddings
* 3b-september-2025
* fixed
* lint
* Update mteb/models/ru_sentence_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* change revision due to flash-attn dependency
* change apply_instruction_to_passages
---------
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
* fix: Refactor split create_tables into static Benchmark methods (#3126)
* feat - Split create_tables into static Benchmark methods
* feat - format
* Update mteb/leaderboard/table.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - remove search query;take benchmark result as input;addressing the circular import,
* feat - format
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - use to_dataframe;clean table.py;move creat_table
* feat - fix circular import
* feat - clean-up
* feat - format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.61
Automatically generated by python-semantic-release
* Extending the RTEB benchmark (#3223)
Adding another voyageai model
* Update tasks & benchmarks tables
* model: New qzmodel (#3211)
* Update qzhou_models.py
* Update qzhou_models.py
* reformat script code
* Update configuration
* According to our new decision, the model name has been changed to "QZhou-Embedding-Zh".
* Fix variable naming issues.
* model: Update Youtu embedding model (#3227)
* add youtu models
* add a blank line
* fix the optional dependencies and lint the code
* remove unused dependencies and reformat
* revise prompt_type
* update youtu_models
---------
Co-authored-by: springxchen <springxchen@tencent.com>
* dataset: Add Software Issue Localization Datasets (#3178)
* add software issue localization datasets
* add software issue localization datasets
* update and add multilingual datasets
* fix citation format issues
* Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py
* fix linting issues
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* feat: Officially include RTEB in the leaderboard (#3222)
* feat - adjust Rteb's Benchmark
* feat - add blank
* fix menu names
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* moving around tasks
* fix: Update RTEB summary columns (#3226)
* fix(models): ensure prompt_type is passed to format_instruction (#3216)
* 1.38.58
Automatically generated by python-semantic-release
* Adding Cohere's output_dimension and embedding_type parameter (#3204)
* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8
* Correcting due to comments
* dataset: add swedish cpc patent classifications to mteb (#3072)
* feat: add swedish cpc patent classifications to mteb
* fix: formatting and init imports
* fix: update mteb task according to feedback
* fix: perform citation and code formatting
* fix: add train and test split for both datasets
* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior
* chore: fix colpali_models similarity handle device
* Update tasks & benchmarks tables
* 1.38.59
Automatically generated by python-semantic-release
* fix: prevent EOS token truncation (#3218)
* fix(models): prevent EOS token truncation for BMRetriever
* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`
* fix(models): correct eos token handling in `BMRetrieverWrapper`
* 1.38.60
Automatically generated by python-semantic-release
* Update giga embeddings (#3210)
* update giga embeddings
* update giga embeddings
* 3b-september-2025
* fixed
* lint
* Update mteb/models/ru_sentence_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* change revision due to flash-attn dependency
* change apply_instruction_to_passages
---------
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
* fix: Refactor split create_tables into static Benchmark methods (#3126)
* feat - Split create_tables into static Benchmark methods
* feat - format
* Update mteb/leaderboard/table.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - remove search query;take benchmark result as input;addressing the circular import,
* feat - format
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - use to_dataframe;clean table.py;move creat_table
* feat - fix circular import
* feat - clean-up
* feat - format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.61
Automatically generated by python-semantic-release
* Extending the RTEB benchmark (#3223)
Adding another voyageai model
* Update tasks & benchmarks tables
* feat - filter_by_privacy
* feat - add new fields for rteb part
* feat - getattr
* feat - adjust privacy filter logic
* feat - enhance summary table column renaming and add 'is_public' field mapping
* fix: remove unused 'is_public' attribute from TaskResult
---------
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
* removed show_rteb args
* avoid defining function where we can just use the metadata
* minor fixes
* minor fixes
* fix: Correct logic for filtering public tasks in ModelResult class (#3230)
Co-authored-by: ethan <smiletoye@gmail.com>
---------
Co-authored-by: q275343119 <275343119@qq.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
* Update tasks & benchmarks tables
* 1.39.0
Automatically generated by python-semantic-release
* fix: Add submission references for RTEB (#3233)
* fix: Add rteb submission references and improve descriptions.
* Added evaluation request
* added field for tasks
* 1.39.1
Automatically generated by python-semantic-release
* dataset: add human tasks and benchmark (#3214)
* Human Subsets Tasks
* Fixed Multilingual Classification Subset
* linting
* fix citations format
* make lint
* fix tests
* remove human folder
* fix relative imports
* add adapted_from for all human subsets
* fix pydantic errors
* add benchmark object
* make benchmark discoverable
* bibtex test
* Apply suggestion
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* rename & reupload
* upd tests
* upd tests again
* add model
* add benchmark to leaderboard
* change branch of leaderboard
* remove branch of load data
* fix model meta path
* make mteb importable
* update repo
* Update mteb/benchmarks/benchmarks/benchmarks.py
* Update mteb/leaderboard/benchmark_selector.py
* Update mteb/load_results/load_results.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com>
* Update tasks & benchmarks tables
* Remove 'HUME(v1)' from leaderboard benchmark (#3236)
* Remove 'HUME(v1)' from leaderboard benchmark
* lint
* docs: Update adding benchmark documentation (#3229)
* update adding_a_benchmark.md documentation
* fix numbers
* fix: Further specified macro-language code for Norwegian (#3228)
* fix: Further specified macro-language code for Norwegian
"nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this.
* furhter specified macro language "nor"
* Update tasks & benchmarks tables
* 1.39.2
Automatically generated by python-semantic-release
* fix max tokens (#3243)
* fix python39 transformers compatibility (#3254)
* fix python39 transformers
* fix
* Aggregate by subset for HUMEv1 (#3255)
aggregate by subset for HUMEv1
* Update tasks & benchmarks tables
* Fix AbsTaskTextRegression task (#3257)
Fix AbsTaskTextRegression
* Added Japanese to Retrieval (#3252)
* feat - add Japanese
* feat - use mteb.get_benchmark
* fix - 3.9 test error
* Revert "fix - 3.9 test error"
This reverts commit 6bfee53cff48304cc22d8248aa275dcc9e385475.
* fix - 3.9 test error
* Update tasks & benchmarks tables
* fix bm25 on small datasets (#3261)
* fix: Move zero-shot percentage calculation to the end of summary (#3231)
* Refactor: Move zero-shot percentage calculation to the end of summary table creation which only apply to RTEB table.
* Update RTEB benchmark name from "RTEB(beta)" to "RTEB" for consistency in display.
* feat - RTEB(beta)
* feat - remove Zero-shot
---------
Co-authored-by: ethan <smiletoye@gmail.com>
* model: Add ReasonIR (#3221)
* model: Add ReasonIR
* Update mteb/models/reasonir_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/reasonir_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* update n_parameters of ReasonIR
Co-authored-by: Niklas <n.muennighoff@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Niklas <n.muennighoff@gmail.com>
* fix: Only pin model name and rank (#3263)
Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column
* 1.39.3
Automatically generated by python-semantic-release
* fix: resolve flash-attention dependency issue (#3265)
* fix: Only pin model name and rank
Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column
* fix: resolve flash-attention dependency issue
This has been tested and works.
fixed Resolve flash-attention dependency issues
Fixes #3240
* 1.39.4
Automatically generated by python-semantic-release
* fix: Add retry and token counting in Cohere models (#3253)
* Retry and token counting in Cohere models
* Retry and token counting in Cohere models
* Retry and token counting in Cohere models
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* 1.39.5
Automatically generated by python-semantic-release
* Align MIEB leaderboards with paper (#3272)
* sort by mean task type and use pure rank for MIEB LBs
* lint
* rename task type column for readability
* fix: add prompt for MIRACLRetrievalHardNegatives (#3266)
* add prompt for MIRACLRetrievalHardNegatives
* add `MIRACLRetrievalHardNegatives.v2`
* Update mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* move common metadata to dict
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tasks & benchmarks tables
* Add Regression task mock (#3271)
* 1.39.6
Automatically generated by python-semantic-release
* fix: Change language for task SlovakMovieReviewSentimentClassification (#3296)
* Update tasks & benchmarks tables
* 1.39.7
Automatically generated by python-semantic-release
* Add english code retriever model (#3302)
* Add en code retriever model
* fix model_name
* Update mteb/models/en_code_retriever.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* correct lint
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* docs: fix typos in `docs/adding_a_benchmark.md` (#3344)
* BREAKING: v2.0.0 (#1433)
* [v2] Merge…
andrejridzik
added a commit
to slovak-nlp/mteb
that referenced
this pull request
Dec 15, 2025
* Add english code retriever model (#3302)
* Add en code retriever model
* fix model_name
* Update mteb/models/en_code_retriever.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* correct lint
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* docs: fix typos in `docs/adding_a_benchmark.md` (#3344)
* BREAKING: v2.0.0 (#1433)
* [v2] Merge `MIEB` into v2 (#1973)
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* 1.31.6
Automatically generated by python-semantic-release
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* fix: remove SummaryRetrieval as a type (#1915)
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* fix: revert rename and add to description (#1918)
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* Update tasks table
* docs: Add sort to domains for task metadata (#1922)
Tests currently go into an infinite loop. This should prevent that.
* Update tasks table
* 1.31.7
Automatically generated by python-semantic-release
* docs: Updated citation for mteb(scandinavian) (#1914)
fix: Updated citation for mteb(scandinavian)
* fix: Add datasets in CodeRAG-Bench (#1595)
* add three out of four datasets in CodeRAG-Bench
* add verified CodeRAGStackoverflowPostsRetrieval dataset
* clean up code and make some comments
* fixed lint errors
* addressed comments about code-rag datasets: fixed grammar and remove unnessary code and loop
* roll back files which is not supposed to change
* fixed the comments in split_by_first_newline() and make the methods private by adding a underscore prefix
* refactor to use common args
* update task descriptions
* add entry in benchmarks
* correct the alphanumeric order for the dataset
* add in tasks.md
* add in tasks.md
* update task metadata
* update importing path
* fix lint errors
* correct CodeRAG task metadata description field and id for stackoverflow-posts
* fix error in test
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks table
* 1.31.8
Automatically generated by python-semantic-release
* Leaderboard: Acks (#1930)
Add acs
* misc: add warning for save_suffix removal from AbsTask (#1940)
add warning for param removal
* misc: add bgev1 models (#1928)
* add bgev1 models
* add bge-*-en
* fix naming
* Updated links in MTEB(eng) and eng,classic (#1948)
* feat: add beir (#1933)
add beir
* 1.32.0
Automatically generated by python-semantic-release
* Fixed join_revisions if results are empty (#1949)
* feat: Merge MIEB into main 🎉 (#1944)
* mieb ZeroshotClassification
* mieb docs
* mieb implementation demo
* model meta; abstask column names; linear probe clf
* model meta; abstask column names; linear probe clf
* fix: update naming as candidate_labels
* Update README.md
* Update README.md
* i2tretrieval
* test load data ignore i2tretrieval
* [MIEB] Add image clustering (#1088)
* make lint
* wip
* add TinyImageNet and run
* type hints
* add accuracy
* lint
* remove unused & fix typos
* T2I Retrieval
* Any2AnyRetrieval
* fix tests from merge
* [MIEB] Add image text pair classification and tests (#1099)
* add ImageTextPairClassification abstask and evaluator
* dataset transform into sequence of images for each sample
* fix processing logic; list of list images compatability
* lint and docstrings
* make lint
* fix failing tests in TaskMetadata
* add tests for mieb
* skip gated repo
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [MIEB] Add image classification and zero shot classification tasks (#1101)
* fix task metadata
* use overrideable column names
* add CIFAR datasets
* add caltech101 dataset
* add FGVC aircraft dataset
* add food 101 dataset
* add OxfordPets dataset
* remove comments
* correct cifar100 path
* update cifar100 classification results
* cifar zero shot results
* add caltech101 zero shot
* matching CLIP paper implementation
* add aircraft and food zero shot
* add oxford pets zero shot
* [MIEB] Add CIFAR clustering (#1104)
add CIFAR clustering
* [MIEB] Add more image classification and zero shot classification datasets (#1103)
* update category to i2t
* add MNIST linear probe and zero shot
* add FER2013 linear probe and zero shot
* add stanford cars linear probe and zero shot
* add birdsnap linear probe and zero shot
* add eurosat linear probe and zero shot
* lint
* correct eurosat zero shot labels
* add abstask for image multilable and voc2007
* make lint
* [MIEB] Add more image classification and zero shot datasets (#1105)
* add STL10 linear probe and zero shot
* add RESISC45 linear probe and zeor shot
* add Describable textures linear probe and zero shot
* fix spacing lint
* add SUN397 linear probe and zero shot
* correct SUN397 zero shot captions
* add baai bge vista
* add e5-v
* linting
* memory issues for image linear probe & zeroshot
* kknn linear probe arguments
* del comments
* Add some classification and ZeroShot classification tasks (#1107)
* Add Country211 classification task
* Add imagenet1k classification task
* Add UCF101 classification task
* Add PatchCamelyon Classification task
* Add GTSRB classification task
* Add GSTRB Zero Shot Classification
* Add country211 zero shot classification
* Add results for classification tasks
* Add zero shot classification tasks
* Add PatchCamelyon tasks and results
* Add linting
* Add results and fix prompts for zero shot
* Add results
* Add results and linting
* fix dependency & clip mock test
* [MIEB] Add jina clip (#1120)
* add jina clip and mscoco i2t and t2i results
* make lint
* [MIEB] Update `mieb` with the `main` branch and some fixes (#1126)
* fix instruction retrival (#1072)
* fix instruction retrival
* fix test
* add points
* make nested results
* add test
* skip instruction test
* fix instruction passes
* fix unions
* move do_length_ablation
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update points table
* fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106)
found bug
* 1.12.85
Automatically generated by python-semantic-release
* fix: MultilingualSentimentClassification (#1109)
* Update points table
* fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks
* 1.12.86
Automatically generated by python-semantic-release
* fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112)
* fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended
* fix: fixed formatting for cli
* docs: improve searchability in the advanced usage documentation
* 1.12.87
Automatically generated by python-semantic-release
* docs: improve searchability in the advanced usage documentation (#1113)
* docs: improve searchability in the advanced usage documentation
* docs: update based on corrections
* fix: export type for `mteb create_meta` (#1114)
* fix export type
* fix dataset version too
* 1.12.88
Automatically generated by python-semantic-release
* fix: Simplify models implementations (#1085)
* Merge
* Adapt
* Simplify
* Check for rev again
* Rmv cmmnt
* Simplify
* simplify
* Rmv comment
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Use logging; change try except; add info
* Lint
* Rmv results
* Update rev
* format
* Simplify models; Allow instructions
* Jobs
* Fix merge
* Format
* Adapt models
* fix: ensure that e5 ignores the NQ
* format
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* 1.12.89
Automatically generated by python-semantic-release
* fix: nomic models using prefix correctly (#1125)
* fix: nomic models using prefix correctly
* chore: remove comment
* fix: handling in case not torch tensor
* Fix typo
---------
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
* 1.12.90
Automatically generated by python-semantic-release
* refactor vista model wrapper to contain lib import
* python 38 type hints
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: anpalmak2003 <73543260+anpalmak2003@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Zach Nussbaum <zanussbaum@gmail.com>
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
* image memoery issues for all retrieval Abstasks
* Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127)
* Add CLEVER and SciMMIR
* Update metadata
* remove useless comment
* Add linting
* fix typo and tests
* Add CLEVR count task
* add linting
* add fashion200k & fashionIQ test passed
* clip text max seq truncation
* add WebQA, NIGHTS, OVEN
* any2any retrieval chunk encoding
* add nomic vision model; any2any topk bug
* add cv recall
* add InfoSeek; VisualNews
* [MIEB] Add Stanford Cars i2i Retrieval (#1147)
* wip
* add results
* make lint
* change back the order
* [MIEB] Add CUB200 i2i retrieval (#1154)
* add cub200 and results
* add skip_first_result
* skipped self and rerun results
* consolidate i2t and t2i to any2any
* remove abstask and evaluators
* remove references from test
* tu-add berlin sketch retrieval
* XM3600; XFlickr30kCO; mutilingual
* wit multilingual retrieval t2i
* correct multilingual t2i meta
* meta
* add dinov2 model; 4 sizes
* cls evaluator channel bug fix
* add ALIGN model
* add FORBI2IRetrieval
* forb & tuberlin new revision
* disable tokenization parallelism
* add hateful meme retrieval i2tt2i
* add memotion retrieval t2ii2t
* add SciMMIR Retrieval i2tt2i
* ruff update
* Visual STS Abstask&evaluator
* add visual STS17
* add visual STS 12-16
* [mieb] Add blip and blip2 models, and ImageNetDog15Clustering task (#1226)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* [mieb] add 3 compositionality evaluation tasks (#1229)
* linting & update unavailable dataset path
* add aro visual relation&attribution; sugarcrepe
* correct reference
* add SOPI2IRetrieval dataset/task (#1232)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* change reference
* Image text pair cls (#1233)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix meta data
* fix validate points
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add RP2kI2IRetrieval and METI2IRetrieval (#1239)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* [MIEB] Adding DataComp CLIP models (#1283)
* adding data comp CLIP models
* update model and caltech101 results
* make lint
* [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix meta data
* fix validate points
* CV-Bench
* evaluator args comment
* fix
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* [mieb] adding 10 tasks (#1290)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add vidore benchmark 10 tasks
* fix reference
* fix old metadata
* fix meta
* [mieb] Adding MOCOv3 models (#1293)
* add moco models first try
* add as a timm model
* add large model results
* make lint
* [mieb] Add more Any2AnyRetrieval datasets (#1285)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* remove GLDv2I2IRetrieval
* [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* [mieb] Fix FORB dataset (#1306)
* correct format
* update results
* add more results
* add more results
* [mieb] run tasks fix (#1302)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix e5v&vista
* task type fix for running tasks
* fix wrong meta
* run mieb script
* script
* lint
* align
* [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] run tasks small fix (#1310)
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* fix e5v&vista
* task type fix for running tasks
* fix wrong meta
* run mieb script
* script
* lint
* align
* fix
* linting
* [mieb] Add VLM2vec (#1323)
* wip vlm2vec model
* making i2t classification work wit Calteh101
* test vlm2vec on other task types
* move peft into class
* feat: Merge main into MIEB (#1329)
* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203)
* fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201)
Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements
- Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements
- Reference: OpenAI's Embedding API documentation on input limits
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
* fix ruff formatting
* Added minor test fixes to ensure reproducility across systems
* Ensure that tmp.json is not created within repo when running tests
* format
* fixes path issues
* Rerun CI
---------
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
* fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207)
Fixes #1206
* 1.14.16
Automatically generated by python-semantic-release
* fix: Normalize licenses including casing, uses of "-" etc.
* fix: Normalize licenses including casing, uses of "-" etc. (#1210)
* fix: Normalize licenses including casing, uses of "-" etc.
* fix tests
* 1.14.17
Automatically generated by python-semantic-release
* fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208)
* Normalize benchmarks to only include tasks
- Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented
- implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks
- Added tests + updated docs
A few outstanding issues:
I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following:
`mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)`
I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks.
* fix error in tests
* format
* Added corrections based on review
* added example and formatted
* 1.14.18
Automatically generated by python-semantic-release
* docs: Fix broken links in docs (#1212)
* Added fixes for broken links in adding_a_dataset and adding_a_model docs.
* Updated link name
* Mismatch of the category of AmazonPolarityClassification (#1220)
Fixes #1219
* Update tasks table
* fix: Ensure that results are returned even when hitting cache (#1215)
Fixes #1122
* 1.14.19
Automatically generated by python-semantic-release
* fix: Allow benchmark to specify eval_splits (#1217)
* fix: Allow benchmark to specify eval_splits
This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object.
To do this it add the following:
- added eval_splits to the Abstask object, which default to metadata.eval_splits
- use the task.eval_splits unless overwritten in mteb.MTEB.run
- added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits
- updated documentation
- renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible
- added tests where relevant
* Added correction based on feedback
* 1.14.20
Automatically generated by python-semantic-release
* Update points table
* Update points table
* docs: clarify adding a model (#1222)
* fix: Add RepLLaMA style models (#1223)
* init commit
* working and reproducing
* lint
* update hashes
* warning
* add pyproject
* Update points table
* 1.14.21
Automatically generated by python-semantic-release
* docs: Update points (#1228)
* Fix case
* Fix casing
* Fix case
* Fix case
* Create 971.jsonl
* Update contrib
* Add contributors
* Update points table
* docs: Add MTEB(code) dataset (#1237)
* docs: Add MTEB(code) dataset
* Fix linting
* Update points table
* Update of my affiliation (#1242)
Update points.md
* Add contributor (#1243)
* fix: @mrshu's name in `points.md` (#1246)
* Use the diacritic character to be inline with Slovak spelling.
Signed-off-by: mr.Shu <mr@shu.io>
* docs: Create benchmarks overview table (#1245)
* fix get_benchmarks method
* add create benchmark script
* make lint
* 1.14.22
Automatically generated by python-semantic-release
* docs: Update affiliation (#1247)
Update points.md
* Added author-information
* Add final author list
* Update points table
* docs: Added coordination point for Jimmy Lee (#1253)
docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan
* Update points table
* fix: Add multilingual Benchmark (#1252)
* fix: Add multilingual bench
* Update mteb/benchmarks/benchmarks.py
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
* format
---------
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
* 1.14.23
Automatically generated by python-semantic-release
* docs: Small point changes & more contributors (#1254)
* Update points.md
* Fix format
* Fix attribution
* Update points table
* fix: Downsample large retrieval datasets (#1236)
* most tasks
* lint
* fix other issues
* refactor
* lint and docs
* add polish
* keep case sensitive mteb paths
* add potential points
* fix points
* fix test about metadata
* update tasks and stats
* lint
* Update points table
* Update tasks table
* 1.14.24
Automatically generated by python-semantic-release
* fix: Get meta from CrossEncoder (#1255)
* remove indent after return
* handle cross encoders for model meta
* make lint
* update filename since we now have model name
* 1.14.25
Automatically generated by python-semantic-release
* fix: Add listing all available benchmarks CLI option (#1256)
* add benchmarks.md in README
* add cli option
* add benchmark cli test case
* correct typo
* 1.14.26
Automatically generated by python-semantic-release
* docs: Update affiliation (#1248)
* Update points.md
* Update points.md
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* docs: Update mteb(eng) calculation (#1258)
* Update mteb(eng) calculation
* Fixed citations
* Update MTEB(eng) + MTEB(multilingual)
* feat: leverage SentenceTransformers' query/passage specific prompts (#1221)
* feat: leverage SentenceTransformer models' query/passage specific prompts
* refactor: remove E5Wrapper
fix: wrong e5 revisions
* fix: default prompt_type to None
* fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub
* fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr
* feat: use Enum for `prompt_type`
* docs: specify how to use prompts with Sentence Transformers
* feat: readd arctic models due to metadata
* 1.15.0
Automatically generated by python-semantic-release
* fix: Add Touche2020v3 and JMTEB (#1262)
* add datasets
* fix metrics
* add Touche2020v3
* fix metadata
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* upd name and supress
* add benchmark class
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update tasks table
* 1.15.1
Automatically generated by python-semantic-release
* fix: Select benchmarks CLI option (#1261)
* add test case for a list of Benchmarks
* add selecting benchmarks CLI option
* typos
* use a separate attribute for benchmarks
* try fixing tests
* should accept string as well
* revert filename change
* use Benchmark and avoid circular import
* fix: derive `results_directory` path from `results_repo` name (#1275)
fix: don't hardcode repo name when downloading results
* 1.15.2
Automatically generated by python-semantic-release
* fix: sorting benchmark tasks by MTEB, then alphabetical (#1271)
* sorted
* fixed formatting
* efficiency changes
* fix test
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* 1.15.3
Automatically generated by python-semantic-release
* ci: Removed 3.8 dependency (#1281)
Changes include:
- remove 3.8 from tests (added 3.11 and 3.12)
- changed other CI to 3.9
- updated lint rules to use 3.8
* Update points table
* fix: Allow Numpy >=2.0 (#1264)
Allow Numpy >=2.0
* 1.15.4
Automatically generated by python-semantic-release
* docs: points for paper writing (#1286)
* Create 1004.jsonl
* Create 1006.jsonl
* Update docs/mmteb/points/1004.jsonl
* Update docs/mmteb/points/1006.jsonl
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Update points table
* Update points table
* Update points table
* docs: Fix a link in the README (#1289)
* Fix a link in the README
And fix some typos.
* Update README.md
* Update points table
* fix: Update benchmarks (#1288)
* make benchmark var name uppercase
* update touche to v3
* add MIRACLRetrievalHardNegatives to multilingual
* add mteb(indic)
* add eu benchmark
* 1.15.5
Automatically generated by python-semantic-release
* fix: Allow numpy<2.0.0 (#1291)
* 1.15.6
Automatically generated by python-semantic-release
* fix: Add metadata dict to QBQTC in C-MTEB (#1292)
* fix QBQTC in C-MTEB
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* 1.15.7
Automatically generated by python-semantic-release
* fix: Remove non-existent eval split of CMNLI (#1294)
fix eval_splits of CMNLI
* 1.15.8
Automatically generated by python-semantic-release
* Leaderboard (#1235)
* Add leaderboard dev
* Renamed MTEBResults to TaskResult
* Moved model and model meta loading utilities into overview.py
* Added get_model_metas to retrieve filtered metadata for models
* Restructured results object and made it into a class instead of a dict
* Added utilities for filtering models on BenchmarkResults objects
* Added to_table utility function to BenchmarkResults
* Added serialization utilities to BenchmarkResults
* Attempted fixing tests
* Added get_model_metas to __init__
* Added get_benchmarks to __init__ and made it return all benchmarks by default
* Added get_benchmarks to __init__
* Made tasks hashable
* Added task filtering based on task objects on BenchmarkResults
* Added BenchmarkResults to __init__
* Added additional arguments to get_scores on two classes
* Made get_scores smarter on BenchmarkResult
* Added basic multilingual benchmark
* Modified benchmark to be able to easily access results
* Added useful properties and filtering functions to BenchmarkResults
* Added minimal functioning example
* Added smarter table, task-list updating and tried fixing dropdown scrolling
* Made restrict_results into a private function
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Removed old leaderboard scripts
* Hardcoded max and min model size
* Removed redundant utils file
* Ran linting
* added leaderboard dependencies as optional
* Fixed union type error on Python 3.9
* Removed references to Dict in task aggregation
* Fixed name errors in _restrict_task_results
* Fixed _restrict_task_results
* Made hf_subsets={'default'} when the task is monolingual in _restric_task_results
* Task dropdown now gets filtered based on the other criteria
* Ran linting again
* Introduced hotfix for reranking test
* Added BenchmarkResults to __all__ in __init__
* Fixed validate_and_filter_scores method, and replaced _restric_task_results with it
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* feat: Use prompts instead of encode_corpus and encode_queries (#1278)
* add prompt per task type
* fix prompt
* upd test
* lint
* fix test
* fix DeprecatedSummarizationEvaluator
* fix prompts
* add test
* lint
* logger info
* use task type only in model_encode
* lint
* update interface
* add prompt types to docs
* fix test
* mock tasks
* mock task registry
* remove last task_type
* fix tests
* lint
* fix test
* fix
* use wrapper and new prompts
* fix tests
* lint
* fix test
* remove conftest
* validate task to prompt_name
* override model prompts
* task to prompt name optional
* fix tests
* fix models
* remove task_to_prompt_name
* remove from mteb __init__
* update docs
* load existing model prompts if model_prompts is None
* fix
* lint
* change wrapper loader
* add wrapper class
* lint
* add wrapper file
* update logging
* upd logging
* refactor reranking
* lint
* remove prints
* 1.16.0
Automatically generated by python-semantic-release
* fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276)
* Add Retrieval SK Quad dataset for Slovak search evaluation
This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development.
* Add Retrieval SK Quad dataset for Slovak search evaluation 2
Added the requested changes on the SKQuadRetrieval.py file
* add task to init
* add missing task metadata
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks table
* 1.16.1
Automatically generated by python-semantic-release
* fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274)
* Add Slovak Hate Speech and Offensive Language
Dataset
This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update.
* Add Slovak Hate Speech and Offensive Language Dataset
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.
* Did requested changes:
- Updated __init__.py to include the new SlovakHateSpeechClassification task.
- Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability.
* resolve linting issues by running `make lint`
* Update tasks table
* WIP: Leaderboard UI improvements (#1312)
* Fixed typos in task_results
* Fixed typos in task_results
* Added Tailwind, reorganized layout and fixed scrolling
* Ran linting
* 1.16.2
Automatically generated by python-semantic-release
* fix: remove duplicate multilingual
* 1.16.3
Automatically generated by python-semantic-release
* fix: Re-upload dataset to hub to avoid using script upload (#1322)
* fix dataset upload
* add linting
* Update tasks table
* 1.16.4
Automatically generated by python-semantic-release
* fix: Add implementations of common reranker models (#1309)
* init
* revert
* revert
* add metadata
* lint
* add reqs
* change to float16
* benchmark lint fix
* 1.16.5
Automatically generated by python-semantic-release
* Add multilingual mFollowIR dataset (#1308)
* add mFollowIR
* paper name
* edit warning->info
* convert to parquet
* lint
* Update tasks table
* Cache the embeddings when requested (#1307)
* add caching
* update test to use close
* change from json to pkl
* fix for window
* cleanup on Windows again
* infer dimension
* move cachewrapper
* add wrapper
* fix
* updates
* fix tests
* fix lint
* lint
* add test
* WIP: Leaderboard UI improvements (#1320)
* Fixed typos in task_results
* Fixed typos in task_results
* Added Tailwind, reorganized layout and fixed scrolling
* Ran linting
* Removed faux benchmark
* Updated layout
* Changed table number format
* Table highlights highest values by making them bold
* Added rank to table, removed organization from model_name
* Added mean rank to table
* Ran linting
* feat: Update metadata for all models (#1316)
* Added model meta
* format
* fixed metadata
* Metadata update for voyage models
* Update mteb/models/cohere_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/cohere_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Added corrections from review
* fix spelling error
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* resolved bugs from pytest --collect-only
* Avoid wrapping all models with the SentenceTransformerWrapper
* Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations
* fixed moved on correction from @Samoed
* conditionally set .predict method on SentenceTransformerWrapper
---------
Signed-off-by: mr.Shu <mr@shu.io>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: Ali Shiraee <ShiraeA@basfad.basf.net>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas van Dongen <thomas123@live.nl>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Orion Weller <31665361+orionw@users.noreply.github.com>
Co-authored-by: John Yang <byjohnyang@gmail.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Marek Šuppa <mrshu@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Xa9aX ツ <mishradiganta91@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
Co-authored-by: Daniel Buades Marcos <daniel.buades@clinia.com>
Co-authored-by: Sathvik Nallamalli <sathviknallamalli@gmail.com>
Co-authored-by: Michael Graczyk <michael@mgraczyk.com>
Co-authored-by: Mariya Hendriksen <35101262+mariyahendriksen@users.noreply.github.com>
Co-authored-by: Santiago Castro <bryant1410@gmail.com>
Co-authored-by: Joey Xia <77958037+ZiyiXia@users.noreply.github.com>
Co-authored-by: Márton Kardos <power.up1163@gmail.com>
Co-authored-by: Oliver <oliver.pejic@students.fhnw.ch>
* [mieb] Add OpenCLIP models (#1335)
* add open clip models
* Update __init__.py
* lint
* fix model overview
* update jina clip
---------
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>
* [mieb] new version with downsampled train split to 32 per class (#1327)
* new version with downsampled train split to 32 per class
* force load truncated image file
* make lint
* add open clip models
* Update __init__.py
* lint
* fix model overview
* fix ImageCLS undersample; run birdsnap
* make lint
* make lint
---------
Co-authored-by: chenghao xiao <85804993+gowitheflow-1998@users.noreply.github.com>
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
Co-authored-by: gowitheflow-1998 <chenghao.xiao@durham.ac.uk>
* [mieb] Fix Jina CLIP (#1349)
fix jina clip v1
* fix: Add clevr license (#1356)
* Add BLINK as multi-choice tasks (#1348)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] add Eva CLIP models (#1369)
* add Eva CLIP models
* make lint
* [mieb] add siglip, cohere multimodal & some fixes for final run (#1357)
* fix dataset type error
* fix clustering metrics
* add siglip & cohere
* update mieb run script
* cohere-v import
* fix
* api key name
* [mieb] fixes for final run (#1374)
* e5_v device arg
* dataloader num_workers
* vista doc
* vista doc
* run mieb
* fix
* Update run_vista.md
* [mieb] Fix torch no grad (#1378)
Fix torch no grad
* [mieb] Fix vlm2vec (#1380)
* fix vlm2vec return dtype
* make lint
* [mieb] Remove null entries from corpus of ROxford, RParis (#1371)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
* remove null examples from corpus of ROxford and RParis
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] fixes (#1390)
* Fix torch no grad
* simplify
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* [MIEB] Remove non-existent method for blip (#1394)
remove non-existent method for blip
* [mieb] fix ALIGN; update Winoground revision id; update run script (#1391)
* fix align & winoground
* lint
* Convert task category to i2i for tasks that only calls image encode
* update categories should include img cls, clustering, and multi label clf
* no op
* no op
* make lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* [mieb] Fix open clip for cv bench count (#1397)
fix shape mismatch
* [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
* remove null examples from corpus of ROxford and RParis
* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice
* update blink metadata
* add updated BLINK results
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] Fix EVA CLIP for CV Bench (#1414)
* unsqueeze after preprocess
* make lint
* [mieb] Add calculate probs for vlm2vec (#1418)
* add method
* make lint
* [mieb] Fix siglip bug & add retrieval datasets (#1424)
* fix siglip
* add edis&gld-v2 i2i
* results
* siglip updated results
* fix siglip non-dataloader tasks
* [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420)
* use moc-lr classifier
* set n_experiments=5
* run dinov2 and some laion models
* add dinov2-giant results
* [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429)
* mieb scripts
* lint
* [MIEB] Change Flickr30k to test split (#1449)
* wip: start adding BLIP models
* add other blip variants
* wip: add blip2_models.py
* make lint
* wip: implement blip2 wrapper
* feat: add blip2 models, still mismatched names
* fix: remove projections from image and text embeddings
* make lint
* wip: add coco BLIP2
* fix: BLIP2 better zero-shot classification without text_proj and vision_proj
* tidy blip2
* add imagenet-dog-15 dataset
* tidy and lint
* remove unused import
* add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator
* add imagenet-10 clustering task
* add SOPI2IRetrieval
* add results forclip on ImageNet10Clustering and ImageNetDog15Clustering
* add SOPI2IRetrieval results for clip 32
* add results for clip vit 32/SOPI2IRetrieval
* resolve conflict
* add RP2kI2IRetrieval dataset
* add RP2kI2IRetrieval results with clip-vit-base-patch32
* update image retrieval __init__.py
* fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets
* add RP2kI2IRetrieval and METI2IRetrieval
* add METI2IRetreival
* add SOP results
* make lign
* new revision for METI2IRetrieval
* make lint
* reset corpus chunk size
* remove wrong classification import
* add Flickr30k T2I and I2T
* add Flickr30k T2I retriebal
* reduced-size MET revision
* fix: add Flickr30k T2I
* make lint
* add two landmark datasets and results
* add Sketchy i2i retrieval
* add task metadata
* add BLINKIT2IRetrieval dataset
* add BLINKIT2TRetrieval
* add ImageCoDeT2IRetrieval
* make lint
* add vizwiz retrieval and results
* fix vizwiz duplicate texts
* add new vizwiz results
* add VQA2 results
* add GLD v2 I2T retrieval
* add gld v2 i2i retrieval
* make lint
* add AbsTaskAny2AnyMultiChoice
* make lint
* remove GLDv2I2IRetrieval
* exclude AbsTaskAny2AnyMultiChoice from test_load_data
* fix e5v&vista
* remove duplicate corpus entries from BLINKIT2TRetreival dataset
* task type fix for running tasks
* update BLINKIT2T metadata
* fix wrong meta
* run mieb script
* split ROxford, RParis into easy, medium and hard
* make lint
* add BLINK as multi choice tasks
* fix: license metadata in wrong format
* remove null examples from corpus of ROxford and RParis
* fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice
* update blink metadata
* add updated BLINK results
* merge upstream mieb
* change Flickr30k to test split
* change flickr to test split
---------
Co-authored-by: gowitheflow-1998 <jsbs54@durham.ac.uk>
* [mieb] Fix VLM2vec dtype (#1462)
* propagate dtype
* fix fuse embeddings using list of PIL images
* [mieb] run script for missing results (#1472)
* task type fix
* scripts
* [mieb] Fix Moco model on CIFAR10Clustering (#1487)
Fix Moco model on CIFAR10Clustering
* [mieb] Fix Flickr30k I2T and T2I (#1505)
* remake flickr30k it2 and t2i
* add openai clip vit-b32 b16 and jina-clip results
* make lint
* [MIEB] add missing siglip models (#1533)
* add udpates
* lint errors
* fix typo (#1535)
* add udpates
* lint errors
* fix small typo
* [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544)
fix numbers
* Discussing a standard for ImageEncoders
* Add Voyage's multimodal embedding (#1555)
* add voyage multimodal & ran 17 tasks
* lint
* typo
* clean
* [mieb] update script for final re-run (#1576)
* mieb final runs
* lint
* fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572)
* fix: no longer using same query text for all of BLINKIT2TMultiChoice
* fix: remove blink subtask
* fix: remove subtask from blink it2i
* fix: align BLINK retrieval to multi choice
* add ROxford and RParis I2I multi choice
* add retrieval metrics to multi choice evaluator
* fix: remove wrong negatives from revisiting multichoice datasets
* fix revisiting datasets
* add new results for revisiting multichoice
* [MIEB] Make multimodal models compatible to `task_name` and `prompt_type` (#1583)
* 1. Make `get_xxx_embeddings` follow `encode`.
2. `ImageDataset.transform` could be `None`.
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Fix arguments
* Try to fix tests
---------
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* fix image encoder (#1596)
* format
* fixed tests
* lint
* [mieb] voyage-v: add exponential backoff and other error handling (#1610)
* add voyage multimodal & ran 17 tasks
* lint
* typo
* clean
* exponential backoff tmp
* downsize large images for voyage api call
* voyage error handling
* lint
* add more results
* make tenacity optional
* lint
* log
* [MIEB] Fix `get_fused_emebddings` (#1612)
* Fix fused
* fix vlm2vec
* Fix lint
* [MIEB] Add new multimodal retrieval tasks (#1611)
* Add new tasks
* Fix score type
* [MIEB] Switch to ViDoRe BEIR version (#1607)
* Fix ViDoRe corpus
* fix lint
* ViDoRe beir version
* Extend MIEB test coverage (#1629)
* add one task from each image AbsTask to test grid
* add visual sts to test grid
* [mieb] Task filtering by modality supported by models (#1633)
* fix function signature for moco loader
* filter out tasks by model modalities
* correct conditions
* add model meta to relevant models
* use modalities instead and separate out constants
* [MIEB] Fix VISTA model (#1638)
Fix vista
* Warn (#1639)
* [mieb] model task modalities matching logic (#1640)
fixing task & model modalities matching logic
* [mieb] Use mock abstask classes (#1648)
* rename to downsampled_dataset_transform
* add mock tasks for mieb
* wip getting to 57%
* make lint
* update mock classes to improve coverage
* omit mock tasks from some tests
* [MIEB] Add code for GME models (#1635)
* Add GME
* Fix infoseek prompts
* Merge instructions
* fix: add version check e5-v in mieb (#1723)
* add version check for e5v model
* Update e5_v.py
* make lint
* fix: change comparison to bigger than (#1743)
change comparison to bigger than
* docs: Rework MIEB docs (#1802)
* combine mieb docs and move to main docs folder
* make flow more coherent
* tidy up
* skip AfriSentiLID for now #1785
* fix typo: exclude MIEB mock tests
* update vista doc
* Apply suggestions from code review
---------
Co-authored-by: Isaac Chung <isaac.chung@team.wrike.com>
* [mieb] Remove results-mieb folder (#1815)
remove results-mieb folder
* [mieb] fixing lrap computation for multi-label classification (#1834)
multi-label cls lrap computation fix
* [mieb] Merge from main (#1853)
* Update tasks table
* 1.19.0
Automatically generated by python-semantic-release
* fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402)
Add the_ugly_duckling.txt for speedtask to Python wheel
* 1.19.1
Automatically generated by python-semantic-release
* fix: Added the necessary trust_remote_code (#1406)
* 1.19.2
Automatically generated by python-semantic-release
* docs: Update recommendation for pushing results (#1401)
fix: Update recommendation for pushing results
* docs: Fix a typo in README (#1430)
Fix typo in readme
* fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398)
Fixes #1389
* 1.19.3
Automatically generated by python-semantic-release
* fix: make samples_per_label a task attribute (#1419)
make samples_per_label a task attr
* fix: Add Korean AutoRAGRetrieval (#1388)
* feat: add AutoRAG Korean embedding retrieval benchmark
* fix: run --- 🧹 Running linters ---
ruff format . # running ruff formatting
716 files left unchanged
ruff check . --fix # running ruff linting
All checks passed!
* fix: add metadata for AutoRAGRetrieval
* change link for markers_bm
* add AutoRAGRetrieval to init.py and update metadata
* add precise metadata
* update metadata: description and license
* delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py
* fix: Add missing benchmarks in benchmarks.py (#1431)
Fixes #1423
* Update tasks table
* 1.19.4
Automatically generated by python-semantic-release
* Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437)
* Added elementary speed/performance plot
* Refactored table formatting code
* Bumped Gradio version
* Added more general info to benchmark description markdown block
* Adjusted margin an range on plot
* Made hover information easier to read on plot
* Made range scaling dynamic in plot
* Moved citation next to benchmark description
* Made titles in benchmark info bold
* Leaderboard: Fixed code benchmarks (#1441)
* fixed code benchmarks
* fix: Made n_parameters formatting smarter and more robust
* fix: changed jina-embeddings-v3 number of parameters from 572K to 572M
* fix: Fixed use_instuctions typo in model overview
* fix: Fixed sentence-transformer compatibility switch
* Ran linting
* Added all languages, tasks, types and domains to options
* Removed resetting options when a new benchmark is selected
* All results now get displayed, but models that haven't been run on everything get nan values in the table
* fix: Count unique texts, data leaks in calculate metrics (#1438)
* add more stat
* add more stat
* update statistics
* fix: update task metadata to allow for null (#1448)
* Update tasks table
* 1.19.5
Automatically generated by python-semantic-release
* Fix: Made data parsing in the leaderboard figure more robust (#1450)
Bugfixes with data parsing in main figure
* Fixed task loading (#1451)
* Fixed task result loading from disk
* Fixed task result loading from disk
* fix: publish (#1452)
* 1.19.6
Automatically generated by python-semantic-release
* fix: Fix load external results with `None` mteb_version (#1453)
* fix
* lint
* 1.19.7
Automatically generated by python-semantic-release
* WIP: Polishing up leaderboard UI (#1461)
* fix: Removed column wrapping on the table, so that it remains readable
* Added disclaimer to figure
* fix: Added links to task info table, switched out license with metric
* fix: loading pre 1.11.0 (#1460)
* small fix
* fix: fix
* 1.19.8
Automatically generated by python-semantic-release
* fix: swap touche2020 to maintain compatibility (#1469)
swap touche2020 for parity
* 1.19.9
Automatically generated by python-semantic-release
* docs: Add sum per language for task counts (#1468)
* add sum per lang
* add sort by sum option
* make lint
* fix: pinned datasets to <3.0.0 (#1470)
* 1.19.10
Automatically generated by python-semantic-release
* feat: add CUREv1 retrieval dataset (#1459)
* feat: add CUREv1 dataset
---------
Co-authored-by: nadshe <nadia.sheikh@clinia.com>
Co-authored-by: olivierr42 <olivier.rousseau@clinia.com>
Co-authored-by: Daniel Buades Marcos <daniel@buad.es>
* feat: add missing domains to medical tasks
* feat: modify benchmark tasks
* chore: benchmark naming
---------
Co-authored-by: nadshe <nadia.sheikh@clinia.com>
Co-authored-by: olivierr42 <olivier.rousseau@clinia.com>
* Update tasks table
* 1.20.0
Automatically generated by python-semantic-release
* fix: check if `model` attr of model exists (#1499)
* check if model attr of model exists
* lint
* Fix retrieval evaluator
* 1.20.1
Automatically generated by python-semantic-release
* fix: Leaderboard demo data loading (#1507)
* Made get_scores error tolerant
* Added join_revisions, made get_scores failsafe
* Fetching metadata fixed fr HF models
* Added failsafe metadata fetching to leaderboard code
* Added revision joining to leaderboard app
* fix
* Only show models that have metadata, when filter_models is called
* Ran linting
* 1.20.2
Automatically generated by python-semantic-release
* fix: leaderboard only shows models that have ModelMeta (#1508)
Filtering for models that have metadata
* 1.20.3
Automatically generated by python-semantic-release
* fix: align readme with current mteb (#1493)
* align readme with current mteb
* align with mieb branch
* fix test
* 1.20.4
Automatically generated by python-semantic-release
* docs: Add lang family mapping and map to task table (#1486)
* add lang family mapping and map to task table
* make lint
* add back some unclassified lang codes
* Update tasks table
* fix: Ensure that models match the names on embedding-benchmarks/results (#1519)
* 1.20.5
Automatically generated by python-semantic-release
* fix: Adding missing metadata on models and mathcing names up with the results repo (#1528)
* Added Voyage 3 models
* Added correct metadata to Cohere models and matched names with the results repo
* 1.20.6
Automatically generated by python-semantic-release
* feat: Evaluate missing splits (#1525)
* fix: evaluate missing splits (#1268)
* implement partial evaluation for missing splits
* lint
* requested changes done from scratch
* test for missing split evaluation added
* uncomment test
* lint
* avoid circular import
* use TaskResult
* skip tests for now
---------
Co-authored-by: Isaac Chung…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
mteb.get_model(model_name, revision)andmteb.get_model_meta(model_name, revision)