Skip to content

refactor: split BRIGHT benchmark into individual subset tasks#3285

Merged
Samoed merged 24 commits intoembeddings-benchmark:mainfrom
whybe-choi:bright-subset-tasks
Jan 19, 2026
Merged

refactor: split BRIGHT benchmark into individual subset tasks#3285
Samoed merged 24 commits intoembeddings-benchmark:mainfrom
whybe-choi:bright-subset-tasks

Conversation

@whybe-choi
Copy link
Contributor

@whybe-choi whybe-choi commented Oct 7, 2025

Close #3268

This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.

Benchmark additions

  • Introduced two new benchmarks, BRIGHT_SUBSETS and BRIGHT_SUBSETS_LONG, to the mteb/benchmarks/benchmarks/benchmarks.py file, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2]
  • Registered the new benchmarks in the mteb/benchmarks/benchmarks/__init__.py file for import and usage. [1] [2]

Descriptive statistics

  • Added descriptive statistics JSON files for each new BRIGHT subset retrieval task, including both standard and long formats (e.g., BrightBiologyRetrieval.json, BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Minor improvement

  • Minor formatting fix in the BEIR_NL benchmark description for improved readability.

@whybe-choi whybe-choi changed the title refactor: split BRIGHT benchmark into individual subset tasks refactor: split BRIGHT benchmark into individual subset tasks Oct 7, 2025
@Samoed Samoed requested a review from Muennighoff October 7, 2025 13:40
@whybe-choi whybe-choi force-pushed the bright-subset-tasks branch from 4240bdb to 826990a Compare October 7, 2025 14:36
KennethEnevoldsen

This comment was marked as resolved.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this change will invalidate all previous results on BRIGHT.

You know that you can also simply subselect from a task using:

task = mteb.get_task("BrightRetrieval", eval_splits=..., hf_subet=...)

For the leaderboard display it is even possible to create custom summary tables (see e.g. #3272)

@Samoed
Copy link
Member

Samoed commented Oct 7, 2025

You know that you can also simply subselect from a task using:

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

@KennethEnevoldsen
Copy link
Contributor

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

Ohh... Yeah that is hard to fix.

I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them

@Muennighoff
Copy link
Contributor

If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one.

@Samoed Samoed added the new dataset Issues related to adding a new task or dataset label Oct 7, 2025
@whybe-choi
Copy link
Contributor Author

Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test?

@Samoed
Copy link
Member

Samoed commented Oct 8, 2025

To check implementation, this will be enough, just don't update old leaderboard

@whybe-choi whybe-choi force-pushed the bright-subset-tasks branch from 826990a to 3ed620f Compare October 8, 2025 11:33
@whybe-choi whybe-choi force-pushed the bright-subset-tasks branch from 3ed620f to 57c757f Compare October 8, 2025 11:53
@whybe-choi
Copy link
Contributor Author

After split BrightRetrieval into multiple tasks, I ran ReasonIR on them with task-specific prompts using the following code:

import torch
import mteb

# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
    "BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}

tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="evaluation/results",
    encode_kwargs={"batch_size": 1},
)

The results are as follows:

  Bio. Earth. Econ. Psy. Rob. Stack. Sus. Leet. Pony AoPS TheoQ. TheoT. Avg.
before split 24.31 30.83 24.27 28.95 18.40 21.68 20.57 18.14 9.49 4.84 18.21 26.42 20.51
after split 26.18 30.71 23.96 29.76 18.62 21.15 19.89 19.65 9.22 5.12 18.34 27.12 20.81

In the paper:
image

@Samoed
Copy link
Member

Samoed commented Oct 9, 2025

Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought get_model?

@whybe-choi
Copy link
Contributor Author

if instruction:
logger.info(f"Using instruction: '{instruction}' for task: '{task_name}'")
embeddings = self.model.encode(
sentences,
prompt=instruction,
**kwargs,
)
if isinstance(embeddings, torch.Tensor):
# sometimes in kwargs can be return_tensors=True
embeddings = embeddings.cpu().detach().float().numpy()
return embeddings

After adding code to print the instruction inside the code, the following output was produced:

# Biology
Retrieval
    - BrightBiologyRetrieval, s2p


instruction: <|user|>
Given a Biology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:06<00:00, 15.80it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 2/50000 [00:02<18:01:38,  1.30s/it
# Psychology
Retrieval
    - BrightPsychologyRetrieval, s2p


instruction: <|user|>
Given a Psychology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:07<00:00, 14.12it/s]
instruction: <|embed|>

Batches:   0%|                                                                                                       | 0/50000 [00:01<?, ?it/s]
# Aops
Retrieval
    - BrightAopsRetrieval, s2p


instruction: <|user|>
Given a Math problem, retrieve relevant examples that help answer the problem<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 111/111 [00:06<00:00, 16.13it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 17/50000 [00:09<7:16:33,  1.91it/s]

@Samoed
Copy link
Member

Samoed commented Oct 9, 2025

Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case.

I've checked code for ReasonIR and found some other places that can help to reproduce:

  1. For some cases, rewritten query is concatenated with query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L82-L87
  2. Sometimes reason trases added to the query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L124
  3. Maybe ids can be filtered (ref Excluded IDs missing from BRIGHT dataset #2696) but in ReasonIR code they're just check that no ids are intersect https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L130-L131

@Muennighoff Can you help what we can do to reproduce results?

@Muennighoff
Copy link
Contributor

I think the IDs filtering is probably the main missing piece to fully reproduce results?

@whybe-choi
Copy link
Contributor Author

I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single ReasonIR model seems to be related to the issue mentioned in point 3.

Samoed and others added 4 commits October 20, 2025 21:56
# Conflicts:
#	mteb/benchmarks/benchmarks/__init__.py
#	mteb/tasks/Retrieval/__init__.py
#	mteb/tasks/retrieval/eng/BrightSubsetsLongRetrieval.py
#	mteb/tasks/retrieval/eng/BrightSubsetsRetrieval.py
@whybe-choi
Copy link
Contributor Author

@Samoed

I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

@Samoed
Copy link
Member

Samoed commented Oct 22, 2025

I think it would be better to close this PR and work on it later together

Do you mean that you don't want tasks in this pr and will add another PR for #2696?

Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

Yes, you need to add statistic to merge. To apply v2 format, you can select subsets from https://huggingface.co/datasets/mteb/BrightRetrieval, but retrieval dataset loader reqired dataset to have strictly corpus, qrels and quries, maybe we need to reupload them instead

@whybe-choi
Copy link
Contributor Author

What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help.

@Samoed
Copy link
Member

Samoed commented Oct 22, 2025

I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution

Comment on lines 22 to 38
domain_corpus_long = datasets.load_dataset(
path,
"long_documents",
split=domain,
cache_dir=cache_dir,
revision=revision,
)
examples = datasets.load_dataset(
path,
"examples",
split=domain,
cache_dir=cache_dir,
revision=revision,
)
corpus["long"] = {e["id"]: {"text": e["content"]} for e in domain_corpus_long}
queries["long"] = {e["id"]: e["query"] for e in examples}
relevant_docs["long"] = defaultdict(dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow v2 format, you can remove conversion dataset to dict and pass dataset directly.

domain_corpus_long = domain_corpus_long.rename_column("content", "text")
queries = queries.rename_column("query", "text")
...
return domain_corpus_long, queires, relevant_docs

if self.data_loaded:
return

self.corpus, self.queries, self.relevant_docs = load_bright_long_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then here it should look like

self.dataset["default"]["long"]["corpus"], self.dataset["default"]["long"]["queries"], self.dataset["default"]["long"]["relevant_documents"]

You can refer to

class RetrievalSplitData(TypedDict):
"""A dictionary containing the corpus, queries, relevant documents, instructions, and top-ranked documents for a retrieval task.
Attributes:
corpus: The corpus dataset containing documents. Should have columns `id`, `title`, `text` or `image`.
queries: The queries dataset containing queries. Should have columns `id`, `text`, `instruction` (for instruction retrieval/reranking) or `image`.
relevant_docs: A mapping of query IDs to relevant document IDs and their relevance scores. Should have columns `query-id`, `corpus-id`, `score`.
top_ranked: A mapping of query IDs to a list of top-ranked document IDs. Should have columns `query-id`, `corpus-ids` (list[str]). This is optional and used for reranking tasks.
"""
corpus: CorpusDatasetType
queries: QueryDatasetType
relevant_docs: RelevantDocumentsType
top_ranked: TopRankedDocumentsType | None

@Samoed
Copy link
Member

Samoed commented Dec 28, 2025

I run with bm25 and got these results. Overall they're showing same difference as in #3285 (comment). So problem totally in data somewhere. I will continue debugging later

task_name bm25s (mteb) bm25 (bright) diff matching
BrightAopsRetrieval 0.03155 0.0498 0.0182 -
BrightBiologyRetrieval 0.08034 0.0774 0.0029 -
BrightEarthScienceRetrieval 0.12803 0.1283 0.0003 +
BrightEconomicsRetrieval 0.1033 0.1026 0.0007 +
BrightLeetcodeRetrieval 0.14568 0.2532 0.1075 -
BrightPonyRetrieval 0.02222 0.0226 0.0004 +
BrightPsychologyRetrieval 0.07645 0.0765 0.0001 +
BrightRoboticsRetrieval 0.06229 0.0623 0.0000 +
BrightStackoverflowRetrieval 0.14359 0.1636 0.0200 -
BrightSustainableLivingRetrieval 0.08385 0.0851 0.0013 -
BrightTheoremQAQuestionsRetrieval 0.05413 0.0735 0.0194 -
BrightTheoremQATheoremsRetrieval 0.01566 0.0051 0.0106 -
BrightBiologyLongRetrieval 0.09547 0.0955 0.0000 +
BrightEarthScienceLongRetrieval 0.17816 0.1782 0.0000 +
BrightEconomicsLongRetrieval 0.15372 0.1537 0.0000 +
BrightPonyLongRetrieval 0.02496 0.0250 0.0000 +
BrightPsychologyLongRetrieval 0.1703 0.1703 0.0000 +
BrightRoboticsLongRetrieval 0.05941 0.0594 0.0000 +
BrightStackoverflowLongRetrieval 0.23504 0.2350 0.0000 +
BrightSustainableLivingLongRetrieval 0.18519 0.1852 0.0000 +

Script to use bm25s in BRIGHT

def retrieval_bm25(queries, query_ids, documents, doc_ids, excluded_ids, long_context, **kwargs):
    import bm25s
    import Stemmer

    stemmer_language = 'english'
    stopwords = 'en'
    stemmer = Stemmer.Stemmer(stemmer_language)
    encoded_corpus = bm25s.tokenize(documents, stopwords=stopwords, stemmer=stemmer)

    retriever = bm25s.BM25()
    retriever.index(encoded_corpus)

    corpus_idx_to_id = {i: doc_id for i, doc_id in enumerate(doc_ids)}
    query_token_strs = bm25s.tokenize(queries, stopwords=stopwords, stemmer=stemmer)

    queries_results, queries_scores = retriever.retrieve(
        query_token_strs,
        k=min(1000, len(corpus_idx_to_id))
    )

    all_scores = {}
    for qi, query_id in enumerate(query_ids):
        query_results = queries_results[qi]
        scores = queries_scores[qi]
        all_scores[str(query_id)] = {}
        for ri in range(len(query_results)):
            doc_idx = query_results[ri]
            score = scores[ri]
            doc_id = corpus_idx_to_id[doc_idx]
            all_scores[str(query_id)][str(doc_id)] = float(score)
        for did in set(excluded_ids[str(query_id)]):
            if did != "N/A" and did in all_scores[str(query_id)]:
                all_scores[str(query_id)].pop(did)
        cur_scores = sorted(all_scores[str(query_id)].items(), key=lambda x: x[1], reverse=True)[:1000]
        all_scores[str(query_id)] = {pair[0]: pair[1] for pair in cur_scores}
    return all_scores

@Samoed
Copy link
Member

Samoed commented Dec 28, 2025

I found a source of problem

data_split["relevant_docs"], data_split["queries"] = (
_filter_queries_without_positives(
data_split["relevant_docs"], data_split["queries"]
)
)

If I remove filtering, then I'm getting the same metrics for bm25. Should we can option to disable filtering to reproduce old results, or we just leave as is? WDYT @KennethEnevoldsen @Muennighoff?

@Samoed Samoed linked an issue Jan 6, 2026 that may be closed by this pull request
@KennethEnevoldsen
Copy link
Contributor

If I remove filtering, then I'm getting the same metrics for bm25. Should we can option to disable filtering to reproduce old results, or we just leave as is? WDYT @KennethEnevoldsen @Muennighoff?

Wouldn't we rather reproduce the old results and then potentially version bump it and add the filtering?

@Samoed
Copy link
Member

Samoed commented Jan 13, 2026

I've run tasks and without this filtering I can reproduce scores from paper. I'm not sure how version bump would help us

@KennethEnevoldsen
Copy link
Contributor

So, if I understand correctly, we have the choices between reproducing paper scores or the previous implementation?

Then I would target the paper and relabel it as a bugfix.

@Samoed
Copy link
Member

Samoed commented Jan 14, 2026

I rerun BRIGHT and now results are matching!

task_name bm25s (mteb) bm25 (bright)
BrightAopsRetrieval 0.0498 0.0498
BrightBiologyRetrieval 0.07736 0.0774
BrightEarthScienceRetrieval 0.12825 0.1283
BrightEconomicsRetrieval 0.10258 0.1026
BrightLeetcodeRetrieval 0.25321 0.2532
BrightPonyRetrieval 0.02264 0.0226
BrightPsychologyRetrieval 0.07645 0.0765
BrightRoboticsRetrieval 0.06229 0.0623
BrightStackoverflowRetrieval 0.16358 0.1636
BrightSustainableLivingRetrieval 0.08507 0.0851
BrightTheoremQAQuestionsRetrieval 0.07349 0.0735
BrightTheoremQATheoremsRetrieval 0.00509 0.0051
BrightBiologyLongRetrieval 0.09547 0.0955
BrightEarthScienceLongRetrieval 0.17816 0.1782
BrightEconomicsLongRetrieval 0.15372 0.1537
BrightPonyLongRetrieval 0.02496 0.0250
BrightPsychologyLongRetrieval 0.1703 0.1703
BrightRoboticsLongRetrieval 0.05941 0.0594
BrightStackoverflowLongRetrieval 0.23504 0.2350
BrightSustainableLivingLongRetrieval 0.18519 0.1852

I changed:

  • Updated revision of datasets to the latest. Seems that our revision was beteween data uploading and datataset wasn't fully finish transition. theoremqa_theorems was updated after this commit
  • bm25 didn't support reranking tasks (used all documents for searching)

Also I've splitted BRIGHT v1.1 on short and long subsets, but maybe we can convert them back. I don't know why I thought that problem was in filtering. I will try to evalute bge model to check instructions

""",
)

BRIGHT_V1_1 = Benchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just combine this into one table with both long and short as two different columns (we can also have different columns for the different domains)
Screenshot 2026-01-14 at 21 05 07

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to delete the benchmark here and add that in a separate PR.

@Samoed
Copy link
Member

Samoed commented Jan 18, 2026

Short Bright

MTEB

task_name BrightBiologyRetrieval BrightEarthScienceRetrieval BrightEconomicsRetrieval BrightPsychologyRetrieval BrightRoboticsRetrieval BrightStackoverflowRetrieval BrightSustainableLivingRetrieval BrightLeetcodeRetrieval BrightPonyRetrieval BrightAopsRetrieval BrightTheoremQAQuestionsRetrieval BrightTheoremQATheoremsRetrieval
BAAI/bge-large-en-v1.5 0.1167 0.24558 0.16605 0.17464 0.11713 0.1083 0.13326 0.26681 0.05724 0.06 0.13057 0.069
sentence-transformers/all-mpnet-base-v2 0.15098 0.20414 0.16639 0.22664 0.08221 0.11024 0.15343 0.26404 0.06996 0.05325 0.20035 0.10779

Bright

Model BrightBiologyRetrieval BrightEarthScienceRetrieval BrightEconomicsRetrieval BrightPsychologyRetrieval BrightRoboticsRetrieval BrightStackoverflowRetrieval BrightSustainableLivingRetrieval BrightLeetcodeRetrieval BrightPonyRetrieval BrightAopsRetrieval BrightTheoremQAQuestionsRetrieval BrightTheoremQATheoremsRetrieval
BAAI/bge-large-en-v1.5 11.7 24.6 16.6 17.5 11.7 10.8 13.3 26.7 5.7 6.0 13.0 6.9
sentence-transformers/all-mpnet-base-v2 15.1 20.4 16.6 22.7 8.2 11.0 15.3 26.4 7.0 5.3 20.0 10.8
image

Long Bright

MTEB

task_name BrightBiologyLongRetrieval BrightEarthScienceLongRetrieval BrightEconomicsLongRetrieval BrightPsychologyLongRetrieval BrightRoboticsLongRetrieval BrightStackoverflowLongRetrieval BrightSustainableLivingLongRetrieval BrightPonyLongRetrieval
BAAI/bge-large-en-v1.5 0.16424 0.2773 0.20874 0.11584 0.10891 0.13248 0.16898 0.0036
sentence-transformers/all-mpnet-base-v2 0.25566 0.34052 0.18932 0.15842 0.10891 0.14957 0.18009 0.0119

Bright

Model BrightBiologyLongRetrieval BrightEarthScienceLongRetrieval BrightEconomicsLongRetrieval BrightPsychologyLongRetrieval BrightRoboticsLongRetrieval BrightStackoverflowLongRetrieval BrightSustainableLivingLongRetrieval BrightPonyLongRetrieval
BAAI/bge-large-en-v1.5 16.4 27.7 20.9 11.6 10.9 13.3 16.9 0.4
sentence-transformers/all-mpnet-base-v2 25.6 34.1 18.9 15.8 10.9 15.0 18.0 1.2
image

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we are good to merge here! Great job everyone.

# Conflicts:
#	mteb/models/model_implementations/bm25.py
@Samoed Samoed force-pushed the bright-subset-tasks branch from d2bcc4a to 75c9017 Compare January 19, 2026 10:52
@Samoed Samoed enabled auto-merge (squash) January 19, 2026 10:53
@Samoed Samoed merged commit 2c9b9e9 into embeddings-benchmark:main Jan 19, 2026
12 checks passed
@whybe-choi whybe-choi deleted the bright-subset-tasks branch January 19, 2026 12:54
Samoed added a commit that referenced this pull request Jan 31, 2026
* fix: Simplify conflicts (#3875)

* simplify conflicts

* add lock

* remove torch

* 2.6.6

Automatically generated by python-semantic-release

* model: add missing sentence transformers and jina models (#3808)

* add sentence transformers models

* add jina v2

* fix modalities

* Don't sync make lint (#3841)

* don't sync make lint

* don't sync make typecheck

* upd ci

* upd ci

* upd ci

* upd ci

* upd ci

* swap

* fix: nv embed version (#3715)

* fix nv embed wrapper

* try to fix

* fix sbert version

* 2.6.7

Automatically generated by python-semantic-release

* add dataset: KoViDoRe(v2) (#3876)

* add dataset: KoViDoRe v2

* fix citation format

* add direct loading

* lint format

* delete benchmark language view

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Add typehint for encode kwargs (#3831)

* add typehint for encode kwargs

* remove num_proc

* remove all num proc

* fix import

* fix docstrings

* model: mixedbread-ai/mxbai-rerank-large-v1 (#3905)

* Add model: mixedbread-ai/mxbai-rerank-large-v1

* apply suggestions

* Added xsmall and base version of reranker models

* lintter

* add model: bflhc/Octen-Embedding-0.6B (#3906)

* fix: KoVidore2EnergyRetrieval revision fix (#3913)

* 2.6.8

Automatically generated by python-semantic-release

* Artifacts for llama-embed-nemotron-8b model (#3919)

add artifacts for llama-embed-nemotron-8b model

* fix: model load test (#3914)

* fix model load test

* trigger on dependencies change

* 2.6.9

Automatically generated by python-semantic-release

* model: Adding voyage-4-large, voyage-4 and voyage-4-lite (#3885)

* Adding voyage-4-large and voyage-4-lite

* Adding voyage-4-large and voyage-4-lite

* Adding voyage-4

* Reverting voyage-4 (as the tokenizer is not yet available publicly)

* added superseeded_by

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* model: Update the nemo retriever reversions to avoid error when loading the model (#3925)

* Update the nemo retriever versions to fix the crash issue with visual_config

* Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py

* Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* docs: Resolve problems with missing documentation links (#3834)

* resolve problems with missing documentation links

* split into files

* feat: Add vLLM support (#3794)

* init

* init

Signed-off-by: wang.yuqi <noooop@126.com>

* ruff

Signed-off-by: wang.yuqi <noooop@126.com>

* - vllm_loader

Signed-off-by: wang.yuqi <noooop@126.com>

* + TYPE_CHECKING

Signed-off-by: wang.yuqi <noooop@126.com>

* Make vLLM exit properly.

Signed-off-by: wang.yuqi <noooop@126.com>

* rename

Signed-off-by: wang.yuqi <noooop@126.com>

* support rerank

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* refine

Signed-off-by: wang.yuqi <noooop@126.com>

* refine

Signed-off-by: wang.yuqi <noooop@126.com>

* Update mteb/models/vllm_wrapper.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* refine

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* + docs

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* + benchmark

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* + more benchmark

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* refine docs

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* refine docs

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* fix typing

* move type ignore

* doc upd

* add test

* Update Makefile

* add support for prompts

* add support for prompts

* - demo

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* make mypy happy

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* fix typehints

* update pyproject

* update pyproject

* update pyproject

* The pooling + dp fails to run.

* fix uv lock

* fix docs

* simplify conflicts

* upd lock

* upd lock

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update docs/advanced_usage/vllm_wrapper.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Apply suggestion from @Samoed

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update

Signed-off-by: wang.yuqi <noooop@126.com>

* update

Signed-off-by: wang.yuqi <noooop@126.com>

---------

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* 2.7.0

Automatically generated by python-semantic-release

* dataset: add ChemRxivRetrieval task to ChemTEB benchmark (#3923)

* dataset: add ChemRxivRetrieval task to ChemTEB benchmark

* fix: add descriptive statistics

* feat: add ChemTEB v1.1 with ChemRxivRetrieval task

* fix: chemteb v1.1 alias

* dataset: Add EuroPIRQRetrieval dataset (#3924)

* dataset: Add EuroPIRQRetrieval dataset

* Removed unnecessary load dataset functions

* model: add nemotron rerank (#3750)

* add nemotron rerank

* move to nvidia models

* removed extra params

* Apply suggestions from code review

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

* remove or

* add docstring

* Update mteb/models/model_implementations/nvidia_models.py

Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com>

* update

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com>

* Update references and citations for ViDoRe V3 benchmark (#3930)

* fix: Update references and citations for ViDoRe V3 benchmark

* foramat citation

* format again

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* model: Adding voyage-4 model (#3927)

* Adding voyage-4 model

* Adding voyage-4 model configs

* fix: temporarily remove private column from RTEB

Link is still missing the note as I am waiting for @isaac-chung and @Samoed to confirm the write-up.

fixes #3902

* added issue link

* fix remove mean (Task)

* lint

* fix: Minor logging fixes by activate `LOG` rule (#3820)

activate logger rule

* 2.7.1

Automatically generated by python-semantic-release

* docs: fix vllm broken link (#3936)

fix vllm link

* model: mixedbread-ai/mxbai-edge-colbert-v0-32m and mixedbread-ai/mxbai-edge-colbert-v0-17m (#3931)

* Add model: mixedbread-ai/mxbai-edge-colbert-v0-32m and mixedbread-ai/mxbai-edge-colbert-v0-17m

* Lintter

* Add quotes

* Update dataset name

* Apply suggestions from code review

* Update mixedbread_ai_models.py

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* model: add pixie_models (#3938)

* model: add pixie_models

* Apply lint formatting

* fix: computation of results with missing scores (#3874)

* fix computation of results with missing scores

* fix test

* change 0 to nan

* change 0 to nan

* remove `fill_missing_scores`

* fix: expose `ResultCache` directly as `mteb.ResultCache` (#3912)

* fix: expose `ResultCache` directly as `mteb.ResultCache`

fixes #3910

* docs: Update docs usage of `ResultCache`

* merge in fixes to remove_private (#3940)

fix: exclude private tasks from Borda rank calculation in RTEB

Co-authored-by: bflhc <kunka.xgw@gmail.com>

* 2.7.2

Automatically generated by python-semantic-release

* fix typo (#3954)

* fix colSmol-256M revision (#3956)

* dedup colnomic_7b and fix loader (#3957)

* dedup colnomic_7b and fix loader

* remove flash_attention_2

* refactor: Activate `TC` (#3800)

* activate tc

* activate `TC`

* small import fix

* fix imports

* fix imports

* fix pil import

* fix benchmark result validation

* full benchmark fix

* update

* fix unpack imports

* upd vllm type

* fix: correct inverted unload_data condition in evaluate (#3929)

Add tests verifying preloaded data is preserved.

Co-authored-by: Daniel Svonava <daniel@superlinked.com>

* fix: temporarily remove private column from RTEB (#3932)

* fix: temporarily remove private column from RTEB

Link is still missing the note as I am waiting for @isaac-chung and @Samoed to confirm the write-up.

fixes #3902

* added issue link

* fix remove mean (Task)

* lint

* merge in fixes to remove_private (#3940)

fix: exclude private tasks from Borda rank calculation in RTEB

Co-authored-by: bflhc <kunka.xgw@gmail.com>

---------

Co-authored-by: bflhc <kunka.xgw@gmail.com>

* 2.7.3

Automatically generated by python-semantic-release

* refactor: split `BRIGHT` benchmark into individual subset tasks (#3285)

* refactor: split BRIGHT benchmark into individual subset tasks

* readd bright

* readd bright subset tasks

* feat: add descriptive stats for BRIGHT subsets retrieval tasks

* feat: add top_ranked for excluded_ids handling

* change main score to recall@1 for long version

* improve BRIGHT task descriptions

* add prompts to BRIGHT retrieval tasks

* refactor: BRIGHT(v1.1)

* calculate descriptive stats for BRIGHTLongRetrieval

* update prompts

* normalize names in prompts

* don't filter tasks

* remove filter_queries_without_positives and update revision

* don't create top ranked if not necessary

* get back naucs

* fix instructions

* add warning

* fix import

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: Update metadata to include active number of parameter to `ModelMeta` (#3837)

* Add active parameter column on LB

* update ModelMeta with parameters

* update ModelMeta of models

* Delete parameter_update_results.csv

* fix test

* fix tests

* delete script

* rename for consistency

* convert active_parameter to property

* rename and fix property

* update embedding parameters for model2vec models

* remove duplicate loading of models

* fix

* lintter

* fix

* remove separate method for embedding parameter calculation

* fix embedding calculation to pass typecheck

* lintter

* fix checking

* rename active parameters

* upd docstring

* fix tests

* remove n_active_parameters_override from ModelMeta of all models

* lintter

* rename file instead of merging main

* fix tests

* correct tests

* Delete model total and active parameters - model_parameters.csv

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* 2.7.4

Automatically generated by python-semantic-release

* fix: use `num_proc` for dataset processing (#3832)

* add typehint for encode kwargs

* remove num_proc

* start adding num_proc

* remove all num proc

* fix import

* add num proc to transform

* add to push to hub

* use num proc in vidore v2

* move num proc to evaluate

* pass num proc everywhere

* fix tests

* fix pylate

* fix image text pair

* fix num workers

* add kwargs to `load_data`

* 2.7.5

Automatically generated by python-semantic-release

* fix: saving aggregated tasks (#3915)

fix saving

* 2.7.6

Automatically generated by python-semantic-release

* model: Adding voyage-4-large (2048d) model configs (#3970)

* Adding voyage-4-large (2048d) model configs

* Adding voyage-4-large 2048d model configs

* Adding voyage-4-large 2048d model configs

* fix: Ensure that retrieval tasks only evaluate on specified subsets instead of all (#3946)

* fix dataset loading

* update logging

* add test

* fix: Add `fill_missing` parameter in `get_model_meta` (#3801)

* Add compute missing parameter in get_model_meta

* fix logs

* fix

* fix from comments

* apply suggestion

* fix method

* add test and fix logic

* address comments

* rename compute_missing to fill_missing

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: leaderboard Nan handling (#3965)

* fix leaderboard

* fix loading aggregated tasks

* Update mteb/results/task_result.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.7.7

Automatically generated by python-semantic-release

* fix: Filled active_parameter_overiride for GritLM/GritLM-8x7B nomic-ai/nomic-embed-text-v2-moe (#3967)

* Filled active_parameter_overiride for ritLM/GritLM-8x7B and nomic-ai/nomic-embed-text-v2-moe

* add correct parameters for nomic-ai/nomic-embed-text-v2-moe

* 2.7.8

Automatically generated by python-semantic-release

* fix: add kwargs to pub chem load data (#3990)

add kwargs to pub chem load data

* 2.7.9

Automatically generated by python-semantic-release

* fix: `BAAI/bge-small-en` model revision (#3993)

fix(models): update invalid bge-small-en revision

* fix: NomicWrapper `get_prompt_name` call (#3995)

fix(models): correct get_prompt_name call in NomicWrapper

* 2.7.10

Automatically generated by python-semantic-release

* fix: `BedrockModel` initialization arguments (#3999)

fix: add model_name arg to BedrockModel init to prevent multiple values for model_id

* 2.7.11

Automatically generated by python-semantic-release

* fix: `dataset_transform` signature in PubChemWikiPairClassification (#4001)

fix: add num_proc arg to PubChemWikiPairClassification dataset_transform

* fix: all dataset transform (#4002)

fix dataset transform

* 2.7.12

Automatically generated by python-semantic-release

* model: Adding Ops-Colqwen3 models (#3987)

* Create ops_colqwen3_models.py

* Refactor OpsColQwen3 model and processor classes

* Update model revision in ops_colqwen3_models.py

* Remove calculate_probs method and fix model name

Removed the calculate_probs method and updated model name.

* format

* fix ds name

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* model: added nomic-ai/nomic-embed-code (#4006)

* Add model metadata for nomic-embed-code

Added new model metadata for 'nomic-embed-code'

* fix nomic_embed_code

* lint

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* Adding nvidia/nemotron-colembed models (#3941)

* Adding nvidia/nemotron-colembed models

* add colembed 4b, 8b model meta

* fix colembed-3b-v2 model name

* update revision for colembed 3b

* update revisions

* Update mteb/models/model_implementations/nvidia_llama_nemoretriever_colemb.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* lint

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* model: added Querit/Querit (#3996)

* querit_models_add

* Querit_Models_Change

* Update

* format revise

* add future

* format revise

* format revise

* last format revison

* last last revise

* last last last revison

* revise

* revise

* change the instruction

* last revison

* revise

* revise

* revise

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* Build image on leaderboard refresh (#4015)

build image on leaderboard refresh

* fix: simplify dependencies (#4017)

* 2.7.13

Automatically generated by python-semantic-release

* fix: Make `mteb.get_model` compatible with `CrossEncoders` (#3988)

* Made mteg.get_model compatible with CrossEncoders and SparseEncoders

* update loader for sparseEncoder

* fix import

* Simplify structure

* Add model_type to sparseEncoder models

* remove detection logic of sparsencoder

* Add tests and documentation

* simplified tests

* updated docs

* fix docs

* fix

* fix grammar

* Update docs/usage/defining_the_model.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/advanced_usage/two_stage_reranking.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/index.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* address comments

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* rename bm25s to baseline/bm25s (#4007)

* rename bm25s to baseline/bm25s

* Update mteb/models/get_model_meta.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove logger message

* rename Human to baseline/Human

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix support for datsets 4.5 with pandas 3 (#3983)

* fix test

* fix: sanitize type for label during array conversion

* lint

* revert typo fix

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* lint

* fix typing

* fix test import

---------

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: Munot Ayush Sunil <munotayush6@kgpian.iitkgp.ac.in>
Co-authored-by: bflhc <kunka.xgw@gmail.com>
Co-authored-by: Yauhen Babakhin <ybabakhin@nvidia.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Sahel Sharifymoghaddam <sahel.sharifi@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: HSILA <ali.shiraee@partners.basf.com>
Co-authored-by: Elias H <40372306+eherra@users.noreply.github.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: antoineedy <antoine.edy@illuin.tech>
Co-authored-by: Bong-Min Kim <klbm126@gmail.com>
Co-authored-by: svonava <svonava@gmail.com>
Co-authored-by: Daniel Svonava <daniel@superlinked.com>
Co-authored-by: HSILA <a.shiraee@gmail.com>
Co-authored-by: caoyi <caoyi0905@mail.hfut.edu.cn>
Co-authored-by: Lukas Kleybolte <32893711+Mozartuss@users.noreply.github.com>
Co-authored-by: rnyak <16246900+rnyak@users.noreply.github.com>
Co-authored-by: youngbeauty250 <140679097+youngbeauty250@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new dataset Issues related to adding a new task or dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Results on BRIGHT not matching Excluded IDs missing from BRIGHT dataset

4 participants

Comments