refactor: split BRIGHT benchmark into individual subset tasks#3285
refactor: split BRIGHT benchmark into individual subset tasks#3285Samoed merged 24 commits intoembeddings-benchmark:mainfrom
BRIGHT benchmark into individual subset tasks#3285Conversation
BRIGHT benchmark into individual subset tasks
4240bdb to
826990a
Compare
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Hmm, this change will invalidate all previous results on BRIGHT.
You know that you can also simply subselect from a task using:
task = mteb.get_task("BrightRetrieval", eval_splits=..., hf_subet=...)
For the leaderboard display it is even possible to create custom summary tables (see e.g. #3272)
Yes, but |
Ohh... Yeah that is hard to fix. I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them |
|
If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one. |
|
Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test? |
|
To check implementation, this will be enough, just don't update old leaderboard |
826990a to
3ed620f
Compare
3ed620f to
57c757f
Compare
|
After split import torch
import mteb
# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
"BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
"BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
"BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
"BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
"BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
"BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
"BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
"BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
"BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
"BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
"BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
"BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}
tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
model = mteb.get_model(
"ReasonIR/ReasonIR-8B",
model_kwargs={"torch_dtype": torch.bfloat16},
prompts_dict=prompts_dict,
)
evaluation.run(
model,
save_predictions=True,
output_folder="evaluation/results",
encode_kwargs={"batch_size": 1},
)The results are as follows:
|
|
Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought |
|
mteb/mteb/models/instruct_wrapper.py Lines 158 to 171 in d2c704c After adding code to print the instruction inside the code, the following output was produced: |
|
Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case. I've checked code for ReasonIR and found some other places that can help to reproduce:
@Muennighoff Can you help what we can do to reproduce results? |
|
I think the IDs filtering is probably the main missing piece to fully reproduce results? |
|
I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single |
# Conflicts: # mteb/benchmarks/benchmarks/__init__.py # mteb/tasks/Retrieval/__init__.py # mteb/tasks/retrieval/eng/BrightSubsetsLongRetrieval.py # mteb/tasks/retrieval/eng/BrightSubsetsRetrieval.py
|
I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think? |
Do you mean that you don't want tasks in this pr and will add another PR for #2696?
Yes, you need to add statistic to merge. To apply |
|
What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help. |
|
I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution |
| domain_corpus_long = datasets.load_dataset( | ||
| path, | ||
| "long_documents", | ||
| split=domain, | ||
| cache_dir=cache_dir, | ||
| revision=revision, | ||
| ) | ||
| examples = datasets.load_dataset( | ||
| path, | ||
| "examples", | ||
| split=domain, | ||
| cache_dir=cache_dir, | ||
| revision=revision, | ||
| ) | ||
| corpus["long"] = {e["id"]: {"text": e["content"]} for e in domain_corpus_long} | ||
| queries["long"] = {e["id"]: e["query"] for e in examples} | ||
| relevant_docs["long"] = defaultdict(dict) |
There was a problem hiding this comment.
To follow v2 format, you can remove conversion dataset to dict and pass dataset directly.
domain_corpus_long = domain_corpus_long.rename_column("content", "text")
queries = queries.rename_column("query", "text")
...
return domain_corpus_long, queires, relevant_docs
| if self.data_loaded: | ||
| return | ||
|
|
||
| self.corpus, self.queries, self.relevant_docs = load_bright_long_data( |
There was a problem hiding this comment.
And then here it should look like
self.dataset["default"]["long"]["corpus"], self.dataset["default"]["long"]["queries"], self.dataset["default"]["long"]["relevant_documents"]
You can refer to
mteb/mteb/abstasks/retrieval_dataset_loaders.py
Lines 25 to 38 in 0ead029
|
I run with bm25 and got these results. Overall they're showing same difference as in #3285 (comment). So problem totally in data somewhere. I will continue debugging later
Script to use bm25s in BRIGHT def retrieval_bm25(queries, query_ids, documents, doc_ids, excluded_ids, long_context, **kwargs):
import bm25s
import Stemmer
stemmer_language = 'english'
stopwords = 'en'
stemmer = Stemmer.Stemmer(stemmer_language)
encoded_corpus = bm25s.tokenize(documents, stopwords=stopwords, stemmer=stemmer)
retriever = bm25s.BM25()
retriever.index(encoded_corpus)
corpus_idx_to_id = {i: doc_id for i, doc_id in enumerate(doc_ids)}
query_token_strs = bm25s.tokenize(queries, stopwords=stopwords, stemmer=stemmer)
queries_results, queries_scores = retriever.retrieve(
query_token_strs,
k=min(1000, len(corpus_idx_to_id))
)
all_scores = {}
for qi, query_id in enumerate(query_ids):
query_results = queries_results[qi]
scores = queries_scores[qi]
all_scores[str(query_id)] = {}
for ri in range(len(query_results)):
doc_idx = query_results[ri]
score = scores[ri]
doc_id = corpus_idx_to_id[doc_idx]
all_scores[str(query_id)][str(doc_id)] = float(score)
for did in set(excluded_ids[str(query_id)]):
if did != "N/A" and did in all_scores[str(query_id)]:
all_scores[str(query_id)].pop(did)
cur_scores = sorted(all_scores[str(query_id)].items(), key=lambda x: x[1], reverse=True)[:1000]
all_scores[str(query_id)] = {pair[0]: pair[1] for pair in cur_scores}
return all_scores |
|
I found a source of problem mteb/mteb/abstasks/retrieval.py Lines 344 to 348 in 42dea01 If I remove filtering, then I'm getting the same metrics for bm25. Should we can option to disable filtering to reproduce old results, or we just leave as is? WDYT @KennethEnevoldsen @Muennighoff? |
Wouldn't we rather reproduce the old results and then potentially version bump it and add the filtering? |
|
I've run tasks and without this filtering I can reproduce scores from paper. I'm not sure how version bump would help us |
|
So, if I understand correctly, we have the choices between reproducing paper scores or the previous implementation? Then I would target the paper and relabel it as a bugfix. |
|
I rerun BRIGHT and now results are matching!
I changed:
Also I've splitted BRIGHT v1.1 on short and long subsets, but maybe we can convert them back. I don't know why I thought that problem was in filtering. I will try to evalute bge model to check instructions |
| """, | ||
| ) | ||
|
|
||
| BRIGHT_V1_1 = Benchmark( |
There was a problem hiding this comment.
Feel free to delete the benchmark here and add that in a separate PR.
# Conflicts: # mteb/models/model_implementations/bm25.py
d2bcc4a to
75c9017
Compare
* fix: Simplify conflicts (#3875) * simplify conflicts * add lock * remove torch * 2.6.6 Automatically generated by python-semantic-release * model: add missing sentence transformers and jina models (#3808) * add sentence transformers models * add jina v2 * fix modalities * Don't sync make lint (#3841) * don't sync make lint * don't sync make typecheck * upd ci * upd ci * upd ci * upd ci * upd ci * swap * fix: nv embed version (#3715) * fix nv embed wrapper * try to fix * fix sbert version * 2.6.7 Automatically generated by python-semantic-release * add dataset: KoViDoRe(v2) (#3876) * add dataset: KoViDoRe v2 * fix citation format * add direct loading * lint format * delete benchmark language view Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Add typehint for encode kwargs (#3831) * add typehint for encode kwargs * remove num_proc * remove all num proc * fix import * fix docstrings * model: mixedbread-ai/mxbai-rerank-large-v1 (#3905) * Add model: mixedbread-ai/mxbai-rerank-large-v1 * apply suggestions * Added xsmall and base version of reranker models * lintter * add model: bflhc/Octen-Embedding-0.6B (#3906) * fix: KoVidore2EnergyRetrieval revision fix (#3913) * 2.6.8 Automatically generated by python-semantic-release * Artifacts for llama-embed-nemotron-8b model (#3919) add artifacts for llama-embed-nemotron-8b model * fix: model load test (#3914) * fix model load test * trigger on dependencies change * 2.6.9 Automatically generated by python-semantic-release * model: Adding voyage-4-large, voyage-4 and voyage-4-lite (#3885) * Adding voyage-4-large and voyage-4-lite * Adding voyage-4-large and voyage-4-lite * Adding voyage-4 * Reverting voyage-4 (as the tokenizer is not yet available publicly) * added superseeded_by --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * model: Update the nemo retriever reversions to avoid error when loading the model (#3925) * Update the nemo retriever versions to fix the crash issue with visual_config * Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py * Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * docs: Resolve problems with missing documentation links (#3834) * resolve problems with missing documentation links * split into files * feat: Add vLLM support (#3794) * init * init Signed-off-by: wang.yuqi <noooop@126.com> * ruff Signed-off-by: wang.yuqi <noooop@126.com> * - vllm_loader Signed-off-by: wang.yuqi <noooop@126.com> * + TYPE_CHECKING Signed-off-by: wang.yuqi <noooop@126.com> * Make vLLM exit properly. Signed-off-by: wang.yuqi <noooop@126.com> * rename Signed-off-by: wang.yuqi <noooop@126.com> * support rerank Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * refine Signed-off-by: wang.yuqi <noooop@126.com> * refine Signed-off-by: wang.yuqi <noooop@126.com> * Update mteb/models/vllm_wrapper.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * refine Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * + docs Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * + benchmark Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * + more benchmark Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * refine docs Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * refine docs Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * fix typing * move type ignore * doc upd * add test * Update Makefile * add support for prompts * add support for prompts * - demo Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * make mypy happy Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * fix typehints * update pyproject * update pyproject * update pyproject * The pooling + dp fails to run. * fix uv lock * fix docs * simplify conflicts * upd lock * upd lock * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update docs/advanced_usage/vllm_wrapper.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Apply suggestion from @Samoed Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update Signed-off-by: wang.yuqi <noooop@126.com> * update Signed-off-by: wang.yuqi <noooop@126.com> --------- Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * 2.7.0 Automatically generated by python-semantic-release * dataset: add ChemRxivRetrieval task to ChemTEB benchmark (#3923) * dataset: add ChemRxivRetrieval task to ChemTEB benchmark * fix: add descriptive statistics * feat: add ChemTEB v1.1 with ChemRxivRetrieval task * fix: chemteb v1.1 alias * dataset: Add EuroPIRQRetrieval dataset (#3924) * dataset: Add EuroPIRQRetrieval dataset * Removed unnecessary load dataset functions * model: add nemotron rerank (#3750) * add nemotron rerank * move to nvidia models * removed extra params * Apply suggestions from code review Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * remove or * add docstring * Update mteb/models/model_implementations/nvidia_models.py Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com> * update --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com> * Update references and citations for ViDoRe V3 benchmark (#3930) * fix: Update references and citations for ViDoRe V3 benchmark * foramat citation * format again --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * model: Adding voyage-4 model (#3927) * Adding voyage-4 model * Adding voyage-4 model configs * fix: temporarily remove private column from RTEB Link is still missing the note as I am waiting for @isaac-chung and @Samoed to confirm the write-up. fixes #3902 * added issue link * fix remove mean (Task) * lint * fix: Minor logging fixes by activate `LOG` rule (#3820) activate logger rule * 2.7.1 Automatically generated by python-semantic-release * docs: fix vllm broken link (#3936) fix vllm link * model: mixedbread-ai/mxbai-edge-colbert-v0-32m and mixedbread-ai/mxbai-edge-colbert-v0-17m (#3931) * Add model: mixedbread-ai/mxbai-edge-colbert-v0-32m and mixedbread-ai/mxbai-edge-colbert-v0-17m * Lintter * Add quotes * Update dataset name * Apply suggestions from code review * Update mixedbread_ai_models.py --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * model: add pixie_models (#3938) * model: add pixie_models * Apply lint formatting * fix: computation of results with missing scores (#3874) * fix computation of results with missing scores * fix test * change 0 to nan * change 0 to nan * remove `fill_missing_scores` * fix: expose `ResultCache` directly as `mteb.ResultCache` (#3912) * fix: expose `ResultCache` directly as `mteb.ResultCache` fixes #3910 * docs: Update docs usage of `ResultCache` * merge in fixes to remove_private (#3940) fix: exclude private tasks from Borda rank calculation in RTEB Co-authored-by: bflhc <kunka.xgw@gmail.com> * 2.7.2 Automatically generated by python-semantic-release * fix typo (#3954) * fix colSmol-256M revision (#3956) * dedup colnomic_7b and fix loader (#3957) * dedup colnomic_7b and fix loader * remove flash_attention_2 * refactor: Activate `TC` (#3800) * activate tc * activate `TC` * small import fix * fix imports * fix imports * fix pil import * fix benchmark result validation * full benchmark fix * update * fix unpack imports * upd vllm type * fix: correct inverted unload_data condition in evaluate (#3929) Add tests verifying preloaded data is preserved. Co-authored-by: Daniel Svonava <daniel@superlinked.com> * fix: temporarily remove private column from RTEB (#3932) * fix: temporarily remove private column from RTEB Link is still missing the note as I am waiting for @isaac-chung and @Samoed to confirm the write-up. fixes #3902 * added issue link * fix remove mean (Task) * lint * merge in fixes to remove_private (#3940) fix: exclude private tasks from Borda rank calculation in RTEB Co-authored-by: bflhc <kunka.xgw@gmail.com> --------- Co-authored-by: bflhc <kunka.xgw@gmail.com> * 2.7.3 Automatically generated by python-semantic-release * refactor: split `BRIGHT` benchmark into individual subset tasks (#3285) * refactor: split BRIGHT benchmark into individual subset tasks * readd bright * readd bright subset tasks * feat: add descriptive stats for BRIGHT subsets retrieval tasks * feat: add top_ranked for excluded_ids handling * change main score to recall@1 for long version * improve BRIGHT task descriptions * add prompts to BRIGHT retrieval tasks * refactor: BRIGHT(v1.1) * calculate descriptive stats for BRIGHTLongRetrieval * update prompts * normalize names in prompts * don't filter tasks * remove filter_queries_without_positives and update revision * don't create top ranked if not necessary * get back naucs * fix instructions * add warning * fix import --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Update metadata to include active number of parameter to `ModelMeta` (#3837) * Add active parameter column on LB * update ModelMeta with parameters * update ModelMeta of models * Delete parameter_update_results.csv * fix test * fix tests * delete script * rename for consistency * convert active_parameter to property * rename and fix property * update embedding parameters for model2vec models * remove duplicate loading of models * fix * lintter * fix * remove separate method for embedding parameter calculation * fix embedding calculation to pass typecheck * lintter * fix checking * rename active parameters * upd docstring * fix tests * remove n_active_parameters_override from ModelMeta of all models * lintter * rename file instead of merging main * fix tests * correct tests * Delete model total and active parameters - model_parameters.csv --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * 2.7.4 Automatically generated by python-semantic-release * fix: use `num_proc` for dataset processing (#3832) * add typehint for encode kwargs * remove num_proc * start adding num_proc * remove all num proc * fix import * add num proc to transform * add to push to hub * use num proc in vidore v2 * move num proc to evaluate * pass num proc everywhere * fix tests * fix pylate * fix image text pair * fix num workers * add kwargs to `load_data` * 2.7.5 Automatically generated by python-semantic-release * fix: saving aggregated tasks (#3915) fix saving * 2.7.6 Automatically generated by python-semantic-release * model: Adding voyage-4-large (2048d) model configs (#3970) * Adding voyage-4-large (2048d) model configs * Adding voyage-4-large 2048d model configs * Adding voyage-4-large 2048d model configs * fix: Ensure that retrieval tasks only evaluate on specified subsets instead of all (#3946) * fix dataset loading * update logging * add test * fix: Add `fill_missing` parameter in `get_model_meta` (#3801) * Add compute missing parameter in get_model_meta * fix logs * fix * fix from comments * apply suggestion * fix method * add test and fix logic * address comments * rename compute_missing to fill_missing --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: leaderboard Nan handling (#3965) * fix leaderboard * fix loading aggregated tasks * Update mteb/results/task_result.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * 2.7.7 Automatically generated by python-semantic-release * fix: Filled active_parameter_overiride for GritLM/GritLM-8x7B nomic-ai/nomic-embed-text-v2-moe (#3967) * Filled active_parameter_overiride for ritLM/GritLM-8x7B and nomic-ai/nomic-embed-text-v2-moe * add correct parameters for nomic-ai/nomic-embed-text-v2-moe * 2.7.8 Automatically generated by python-semantic-release * fix: add kwargs to pub chem load data (#3990) add kwargs to pub chem load data * 2.7.9 Automatically generated by python-semantic-release * fix: `BAAI/bge-small-en` model revision (#3993) fix(models): update invalid bge-small-en revision * fix: NomicWrapper `get_prompt_name` call (#3995) fix(models): correct get_prompt_name call in NomicWrapper * 2.7.10 Automatically generated by python-semantic-release * fix: `BedrockModel` initialization arguments (#3999) fix: add model_name arg to BedrockModel init to prevent multiple values for model_id * 2.7.11 Automatically generated by python-semantic-release * fix: `dataset_transform` signature in PubChemWikiPairClassification (#4001) fix: add num_proc arg to PubChemWikiPairClassification dataset_transform * fix: all dataset transform (#4002) fix dataset transform * 2.7.12 Automatically generated by python-semantic-release * model: Adding Ops-Colqwen3 models (#3987) * Create ops_colqwen3_models.py * Refactor OpsColQwen3 model and processor classes * Update model revision in ops_colqwen3_models.py * Remove calculate_probs method and fix model name Removed the calculate_probs method and updated model name. * format * fix ds name --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * model: added nomic-ai/nomic-embed-code (#4006) * Add model metadata for nomic-embed-code Added new model metadata for 'nomic-embed-code' * fix nomic_embed_code * lint --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * Adding nvidia/nemotron-colembed models (#3941) * Adding nvidia/nemotron-colembed models * add colembed 4b, 8b model meta * fix colembed-3b-v2 model name * update revision for colembed 3b * update revisions * Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * lint --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * model: added Querit/Querit (#3996) * querit_models_add * Querit_Models_Change * Update * format revise * add future * format revise * format revise * last format revison * last last revise * last last last revison * revise * revise * change the instruction * last revison * revise * revise * revise --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * Build image on leaderboard refresh (#4015) build image on leaderboard refresh * fix: simplify dependencies (#4017) * 2.7.13 Automatically generated by python-semantic-release * fix: Make `mteb.get_model` compatible with `CrossEncoders` (#3988) * Made mteg.get_model compatible with CrossEncoders and SparseEncoders * update loader for sparseEncoder * fix import * Simplify structure * Add model_type to sparseEncoder models * remove detection logic of sparsencoder * Add tests and documentation * simplified tests * updated docs * fix docs * fix * fix grammar * Update docs/usage/defining_the_model.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/advanced_usage/two_stage_reranking.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * address comments --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * rename bm25s to baseline/bm25s (#4007) * rename bm25s to baseline/bm25s * Update mteb/models/get_model_meta.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove logger message * rename Human to baseline/Human --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix support for datsets 4.5 with pandas 3 (#3983) * fix test * fix: sanitize type for label during array conversion * lint * revert typo fix --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * lint * fix typing * fix test import --------- Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: Munot Ayush Sunil <munotayush6@kgpian.iitkgp.ac.in> Co-authored-by: bflhc <kunka.xgw@gmail.com> Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Sahel Sharifymoghaddam <sahel.sharifi@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: HSILA <ali.shiraee@partners.basf.com> Co-authored-by: Elias H <40372306+eherra@users.noreply.github.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: antoineedy <antoine.edy@illuin.tech> Co-authored-by: Bong-Min Kim <klbm126@gmail.com> Co-authored-by: svonava <svonava@gmail.com> Co-authored-by: Daniel Svonava <daniel@superlinked.com> Co-authored-by: HSILA <a.shiraee@gmail.com> Co-authored-by: caoyi <caoyi0905@mail.hfut.edu.cn> Co-authored-by: Lukas Kleybolte <32893711+Mozartuss@users.noreply.github.com> Co-authored-by: rnyak <16246900+rnyak@users.noreply.github.com> Co-authored-by: youngbeauty250 <140679097+youngbeauty250@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>




Close #3268
This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.
Benchmark additions
BRIGHT_SUBSETSandBRIGHT_SUBSETS_LONG, to themteb/benchmarks/benchmarks/benchmarks.pyfile, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2]mteb/benchmarks/benchmarks/__init__.pyfile for import and usage. [1] [2]Descriptive statistics
BrightBiologyRetrieval.json,BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]Minor improvement
BEIR_NLbenchmark description for improved readability.