dataset: add ChemRxivRetrieval task to ChemTEB benchmark#3923
dataset: add ChemRxivRetrieval task to ChemTEB benchmark#3923Samoed merged 5 commits intoembeddings-benchmark:mainfrom
Conversation
| "PubChemWikiPairClassification", | ||
| "ChemNQRetrieval", | ||
| "ChemHotpotQARetrieval", | ||
| "ChemRxivRetrieval", |
There was a problem hiding this comment.
We don't modify existing benchmarks. You can create new version of ChemTEB instead (e.g. ChemTEB(v1.1))
There was a problem hiding this comment.
@Samoed Thanks for your input. I understand, but can we make it so that running this benchmark triggers the latest version during evaluation?
There was a problem hiding this comment.
It's better to create a new version, because adding a new task makes results a bit less reproducible
There was a problem hiding this comment.
@Samoed I got that point. What I mean is, when we run this:
mteb run -m <model_name> -b "ChemTEB"
it runs the latest version, and for the old version we simply use ChemTEB(v1). I was thinking about changing the old one's name to ChemTEB(v1) and the new one's alias to ChemTEB. Does that work?
There was a problem hiding this comment.
Alias would work, but currently your benchmark don't have only ChemTEB as name, so I'm not sure. WDYT @KennethEnevoldsen?
There was a problem hiding this comment.
Hmm, this would break backward compatibility. We can make it such that there is a deprecation warning if you run ChemTEB(v1), but to not break existing code, we can't redirect ChemTEB to the latest version, as this would change the expected behaviour.
I would suggest something like ChemTEB-latest so that people know that it can change when they set it up. I think could be a suggestion (but that would be a seperate issue)
There was a problem hiding this comment.
@KennethEnevoldsen That makes sense. I updated the PR with a new commit. In this new commit, I added a new benchmark with the name ChemTEB(v1.1) and the alias ChemTEB-latest. I also added the alias ChemTEB(v1) to the old benchmark without changing its name (ChemTEB) for backward compatibility. I think it should be okay now. I didn't understand your point about a separate issue; for what should I open a new one?
* fix: Simplify conflicts (#3875) * simplify conflicts * add lock * remove torch * 2.6.6 Automatically generated by python-semantic-release * model: add missing sentence transformers and jina models (#3808) * add sentence transformers models * add jina v2 * fix modalities * Don't sync make lint (#3841) * don't sync make lint * don't sync make typecheck * upd ci * upd ci * upd ci * upd ci * upd ci * swap * fix: nv embed version (#3715) * fix nv embed wrapper * try to fix * fix sbert version * 2.6.7 Automatically generated by python-semantic-release * add dataset: KoViDoRe(v2) (#3876) * add dataset: KoViDoRe v2 * fix citation format * add direct loading * lint format * delete benchmark language view Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Add typehint for encode kwargs (#3831) * add typehint for encode kwargs * remove num_proc * remove all num proc * fix import * fix docstrings * model: mixedbread-ai/mxbai-rerank-large-v1 (#3905) * Add model: mixedbread-ai/mxbai-rerank-large-v1 * apply suggestions * Added xsmall and base version of reranker models * lintter * add model: bflhc/Octen-Embedding-0.6B (#3906) * fix: KoVidore2EnergyRetrieval revision fix (#3913) * 2.6.8 Automatically generated by python-semantic-release * Artifacts for llama-embed-nemotron-8b model (#3919) add artifacts for llama-embed-nemotron-8b model * fix: model load test (#3914) * fix model load test * trigger on dependencies change * 2.6.9 Automatically generated by python-semantic-release * model: Adding voyage-4-large, voyage-4 and voyage-4-lite (#3885) * Adding voyage-4-large and voyage-4-lite * Adding voyage-4-large and voyage-4-lite * Adding voyage-4 * Reverting voyage-4 (as the tokenizer is not yet available publicly) * added superseeded_by --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * model: Update the nemo retriever reversions to avoid error when loading the model (#3925) * Update the nemo retriever versions to fix the crash issue with visual_config * Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py * Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * docs: Resolve problems with missing documentation links (#3834) * resolve problems with missing documentation links * split into files * feat: Add vLLM support (#3794) * init * init Signed-off-by: wang.yuqi <noooop@126.com> * ruff Signed-off-by: wang.yuqi <noooop@126.com> * - vllm_loader Signed-off-by: wang.yuqi <noooop@126.com> * + TYPE_CHECKING Signed-off-by: wang.yuqi <noooop@126.com> * Make vLLM exit properly. Signed-off-by: wang.yuqi <noooop@126.com> * rename Signed-off-by: wang.yuqi <noooop@126.com> * support rerank Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * refine Signed-off-by: wang.yuqi <noooop@126.com> * refine Signed-off-by: wang.yuqi <noooop@126.com> * Update mteb/models/vllm_wrapper.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * refine Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * + docs Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * + benchmark Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * + more benchmark Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * refine docs Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * refine docs Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * fix typing * move type ignore * doc upd * add test * Update Makefile * add support for prompts * add support for prompts * - demo Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * make mypy happy Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * fix typehints * update pyproject * update pyproject * update pyproject * The pooling + dp fails to run. * fix uv lock * fix docs * simplify conflicts * upd lock * upd lock * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Apply suggestion from @Samoed Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update Signed-off-by: wang.yuqi <noooop@126.com> * update Signed-off-by: wang.yuqi <noooop@126.com> --------- Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * 2.7.0 Automatically generated by python-semantic-release * dataset: add ChemRxivRetrieval task to ChemTEB benchmark (#3923) * dataset: add ChemRxivRetrieval task to ChemTEB benchmark * fix: add descriptive statistics * feat: add ChemTEB v1.1 with ChemRxivRetrieval task * fix: chemteb v1.1 alias * dataset: Add EuroPIRQRetrieval dataset (#3924) * dataset: Add EuroPIRQRetrieval dataset * Removed unnecessary load dataset functions * model: add nemotron rerank (#3750) * add nemotron rerank * move to nvidia models * removed extra params * Apply suggestions from code review Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * remove or * add docstring * Update mteb/models/model_implementations/nvidia_models.py Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com> * update --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com> * Update references and citations for ViDoRe V3 benchmark (#3930) * fix: Update references and citations for ViDoRe V3 benchmark * foramat citation * format again --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * model: Adding voyage-4 model (#3927) * Adding voyage-4 model * Adding voyage-4 model configs * fix: temporarily remove private column from RTEB Link is still missing the note as I am waiting for @isaac-chung and @Samoed to confirm the write-up. fixes #3902 * added issue link * fix remove mean (Task) * lint * fix: Minor logging fixes by activate `LOG` rule (#3820) activate logger rule * 2.7.1 Automatically generated by python-semantic-release * docs: fix vllm broken link (#3936) fix vllm link * model: mixedbread-ai/mxbai-edge-colbert-v0-32m and mixedbread-ai/mxbai-edge-colbert-v0-17m (#3931) * Add model: mixedbread-ai/mxbai-edge-colbert-v0-32m and mixedbread-ai/mxbai-edge-colbert-v0-17m * Lintter * Add quotes * Update dataset name * Apply suggestions from code review * Update mixedbread_ai_models.py --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * model: add pixie_models (#3938) * model: add pixie_models * Apply lint formatting * fix: computation of results with missing scores (#3874) * fix computation of results with missing scores * fix test * change 0 to nan * change 0 to nan * remove `fill_missing_scores` * fix: expose `ResultCache` directly as `mteb.ResultCache` (#3912) * fix: expose `ResultCache` directly as `mteb.ResultCache` fixes #3910 * docs: Update docs usage of `ResultCache` * merge in fixes to remove_private (#3940) fix: exclude private tasks from Borda rank calculation in RTEB Co-authored-by: bflhc <kunka.xgw@gmail.com> * 2.7.2 Automatically generated by python-semantic-release * fix typo (#3954) * fix colSmol-256M revision (#3956) * dedup colnomic_7b and fix loader (#3957) * dedup colnomic_7b and fix loader * remove flash_attention_2 * refactor: Activate `TC` (#3800) * activate tc * activate `TC` * small import fix * fix imports * fix imports * fix pil import * fix benchmark result validation * full benchmark fix * update * fix unpack imports * upd vllm type * fix: correct inverted unload_data condition in evaluate (#3929) Add tests verifying preloaded data is preserved. Co-authored-by: Daniel Svonava <daniel@superlinked.com> * fix: temporarily remove private column from RTEB (#3932) * fix: temporarily remove private column from RTEB Link is still missing the note as I am waiting for @isaac-chung and @Samoed to confirm the write-up. fixes #3902 * added issue link * fix remove mean (Task) * lint * merge in fixes to remove_private (#3940) fix: exclude private tasks from Borda rank calculation in RTEB Co-authored-by: bflhc <kunka.xgw@gmail.com> --------- Co-authored-by: bflhc <kunka.xgw@gmail.com> * 2.7.3 Automatically generated by python-semantic-release * refactor: split `BRIGHT` benchmark into individual subset tasks (#3285) * refactor: split BRIGHT benchmark into individual subset tasks * readd bright * readd bright subset tasks * feat: add descriptive stats for BRIGHT subsets retrieval tasks * feat: add top_ranked for excluded_ids handling * change main score to recall@1 for long version * improve BRIGHT task descriptions * add prompts to BRIGHT retrieval tasks * refactor: BRIGHT(v1.1) * calculate descriptive stats for BRIGHTLongRetrieval * update prompts * normalize names in prompts * don't filter tasks * remove filter_queries_without_positives and update revision * don't create top ranked if not necessary * get back naucs * fix instructions * add warning * fix import --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Update metadata to include active number of parameter to `ModelMeta` (#3837) * Add active parameter column on LB * update ModelMeta with parameters * update ModelMeta of models * Delete parameter_update_results.csv * fix test * fix tests * delete script * rename for consistency * convert active_parameter to property * rename and fix property * update embedding parameters for model2vec models * remove duplicate loading of models * fix * lintter * fix * remove separate method for embedding parameter calculation * fix embedding calculation to pass typecheck * lintter * fix checking * rename active parameters * upd docstring * fix tests * remove n_active_parameters_override from ModelMeta of all models * lintter * rename file instead of merging main * fix tests * correct tests * Delete model total and active parameters - model_parameters.csv --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * 2.7.4 Automatically generated by python-semantic-release * fix: use `num_proc` for dataset processing (#3832) * add typehint for encode kwargs * remove num_proc * start adding num_proc * remove all num proc * fix import * add num proc to transform * add to push to hub * use num proc in vidore v2 * move num proc to evaluate * pass num proc everywhere * fix tests * fix pylate * fix image text pair * fix num workers * add kwargs to `load_data` * 2.7.5 Automatically generated by python-semantic-release * fix: saving aggregated tasks (#3915) fix saving * 2.7.6 Automatically generated by python-semantic-release * model: Adding voyage-4-large (2048d) model configs (#3970) * Adding voyage-4-large (2048d) model configs * Adding voyage-4-large 2048d model configs * Adding voyage-4-large 2048d model configs * fix: Ensure that retrieval tasks only evaluate on specified subsets instead of all (#3946) * fix dataset loading * update logging * add test * fix: Add `fill_missing` parameter in `get_model_meta` (#3801) * Add compute missing parameter in get_model_meta * fix logs * fix * fix from comments * apply suggestion * fix method * add test and fix logic * address comments * rename compute_missing to fill_missing --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: leaderboard Nan handling (#3965) * fix leaderboard * fix loading aggregated tasks * Update mteb/results/task_result.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * 2.7.7 Automatically generated by python-semantic-release * fix: Filled active_parameter_overiride for GritLM/GritLM-8x7B nomic-ai/nomic-embed-text-v2-moe (#3967) * Filled active_parameter_overiride for ritLM/GritLM-8x7B and nomic-ai/nomic-embed-text-v2-moe * add correct parameters for nomic-ai/nomic-embed-text-v2-moe * 2.7.8 Automatically generated by python-semantic-release * fix: add kwargs to pub chem load data (#3990) add kwargs to pub chem load data * 2.7.9 Automatically generated by python-semantic-release * fix: `BAAI/bge-small-en` model revision (#3993) fix(models): update invalid bge-small-en revision * fix: NomicWrapper `get_prompt_name` call (#3995) fix(models): correct get_prompt_name call in NomicWrapper * 2.7.10 Automatically generated by python-semantic-release * fix: `BedrockModel` initialization arguments (#3999) fix: add model_name arg to BedrockModel init to prevent multiple values for model_id * 2.7.11 Automatically generated by python-semantic-release * fix: `dataset_transform` signature in PubChemWikiPairClassification (#4001) fix: add num_proc arg to PubChemWikiPairClassification dataset_transform * fix: all dataset transform (#4002) fix dataset transform * 2.7.12 Automatically generated by python-semantic-release * model: Adding Ops-Colqwen3 models (#3987) * Create ops_colqwen3_models.py * Refactor OpsColQwen3 model and processor classes * Update model revision in ops_colqwen3_models.py * Remove calculate_probs method and fix model name Removed the calculate_probs method and updated model name. * format * fix ds name --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * model: added nomic-ai/nomic-embed-code (#4006) * Add model metadata for nomic-embed-code Added new model metadata for 'nomic-embed-code' * fix nomic_embed_code * lint --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * Adding nvidia/nemotron-colembed models (#3941) * Adding nvidia/nemotron-colembed models * add colembed 4b, 8b model meta * fix colembed-3b-v2 model name * update revision for colembed 3b * update revisions * Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * lint --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * model: added Querit/Querit (#3996) * querit_models_add * Querit_Models_Change * Update * format revise * add future * format revise * format revise * last format revison * last last revise * last last last revison * revise * revise * change the instruction * last revison * revise * revise * revise --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * Build image on leaderboard refresh (#4015) build image on leaderboard refresh * fix: simplify dependencies (#4017) * 2.7.13 Automatically generated by python-semantic-release * fix: Make `mteb.get_model` compatible with `CrossEncoders` (#3988) * Made mteg.get_model compatible with CrossEncoders and SparseEncoders * update loader for sparseEncoder * fix import * Simplify structure * Add model_type to sparseEncoder models * remove detection logic of sparsencoder * Add tests and documentation * simplified tests * updated docs * fix docs * fix * fix grammar * Update docs/usage/defining_the_model.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/advanced_usage/two_stage_reranking.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * address comments --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * rename bm25s to baseline/bm25s (#4007) * rename bm25s to baseline/bm25s * Update mteb/models/get_model_meta.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove logger message * rename Human to baseline/Human --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix support for datsets 4.5 with pandas 3 (#3983) * fix test * fix: sanitize type for label during array conversion * lint * revert typo fix --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * lint * fix typing * fix test import --------- Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: Munot Ayush Sunil <munotayush6@kgpian.iitkgp.ac.in> Co-authored-by: bflhc <kunka.xgw@gmail.com> Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Sahel Sharifymoghaddam <sahel.sharifi@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: HSILA <ali.shiraee@partners.basf.com> Co-authored-by: Elias H <40372306+eherra@users.noreply.github.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: antoineedy <antoine.edy@illuin.tech> Co-authored-by: Bong-Min Kim <klbm126@gmail.com> Co-authored-by: svonava <svonava@gmail.com> Co-authored-by: Daniel Svonava <daniel@superlinked.com> Co-authored-by: HSILA <a.shiraee@gmail.com> Co-authored-by: caoyi <caoyi0905@mail.hfut.edu.cn> Co-authored-by: Lukas Kleybolte <32893711+Mozartuss@users.noreply.github.com> Co-authored-by: rnyak <16246900+rnyak@users.noreply.github.com> Co-authored-by: youngbeauty250 <140679097+youngbeauty250@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR introduces the ChemRxivRetrieval dataset, a high-quality, domain-specific retrieval task designed to complement and expand the existing ChemTEB benchmark introduced in #1708.
As one of the authors of ChemTEB, I am proposing this addition to address a specific gap in our current retrieval evaluation. While the existing retrieval tasks in ChemTEB (based on chemical subsets of NQ and HotpotQA) are valuable, they are relatively small and primarily encyclopedic in nature. ChemRxivRetrieval provides a more comprehensive and technically demanding evaluation that reflects real-world article retrieval and question-answering workflows in the chemical sciences.
Dataset Overview
This addition does not replace previous tasks but acts as a critical extension for users requiring evaluation on authentic, literature-based chemical data.
https://arxiv.org/abs/2508.01643
mtebmtebpackage.mteb run -m {model_name} -t {task_name}command. Results heresentence-transformers/paraphrase-multilingual-MiniLM-L12-v2intfloat/multilingual-e5-small