Adding `voyageai/voyage-4-large (embed_dim=2048)` results by fzoll · Pull Request #404 · embeddings-benchmark/results

fzoll · 2026-01-19T16:33:44Z

Adding voyage-4-large (2048d) results

Checklist

My model has a model sheet, report, or similar
My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
- No, but there is an existing PR Adding voyage-4-large (2048d) model configs mteb#3970
The results submitted are obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

github-actions · 2026-01-19T16:37:25Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: voyageai/voyage-4-large (embed_dim=2048)
Tasks: AILACasedocs, AILAStatutes, AppsRetrieval, CUREv1, ChatDoctorRetrieval, Code1Retrieval, DS1000Retrieval, EnglishFinance1Retrieval, EnglishFinance2Retrieval, EnglishFinance3Retrieval, EnglishFinance4Retrieval, EnglishHealthcare1Retrieval, FinQARetrieval, FinanceBenchRetrieval, French1Retrieval, FrenchLegal1Retrieval, FreshStackRetrieval, German1Retrieval, GermanHealthcare1Retrieval, GermanLegal1Retrieval, HC3FinanceRetrieval, HumanEvalRetrieval, JapaneseCode1Retrieval, JapaneseLegal1Retrieval, LegalQuAD, LegalSummarization, MBPPRetrieval, MIRACLRetrievalHardNegatives, WikiSQLRetrieval

Results for `voyageai/voyage-4-large (embed_dim=2048)`

task_name	google/gemini-embedding-001	voyageai/voyage-4-large (embed_dim=2048)	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
AILACasedocs	0.4833	0.4749	0.2643	0.6541	bflhc/Octen-Embedding-8B	False
AILAStatutes	0.4877	0.5029	0.2084	0.9313	bflhc/Octen-Embedding-8B	False
AppsRetrieval	0.9375	0.9729	0.3255	0.9722	voyageai/voyage-4-large	False
CUREv1	0.5957	0.6782	0.5162	0.6694	voyageai/voyage-4-large	False
ChatDoctorRetrieval	0.7352	0.7722	0.5687	0.7674	voyageai/voyage-4-large	False
Code1Retrieval	0.9474	0.9452	nan	0.9474	google/gemini-embedding-001	False
DS1000Retrieval	0.6870	0.7129	nan	0.7117	voyageai/voyage-4-large	False
EnglishFinance1Retrieval	0.7332	0.8428	nan	0.8218	voyageai/voyage-4-large	False
EnglishFinance2Retrieval	0.6740	0.9137	nan	0.9099	voyageai/voyage-4-large	False
EnglishFinance3Retrieval	0.8330	0.8361	nan	0.8509	nvidia/NV-Embed-v2	False
EnglishFinance4Retrieval	0.5757	0.6241	nan	0.6198	voyageai/voyage-4-large	False
EnglishHealthcare1Retrieval	0.6338	0.6828	nan	0.6875	bm25s	False
FinQARetrieval	0.6464	0.8897	nan	0.8865	voyageai/voyage-4-large	False
FinanceBenchRetrieval	0.9157	0.9315	nan	0.9459	bflhc/Octen-Embedding-8B	False
French1Retrieval	0.8781	0.8653	nan	0.8884	Cohere/Cohere-embed-v4.0	False
FrenchLegal1Retrieval	0.8696	0.9426	nan	0.9490	bm25s	False
FreshStackRetrieval	0.3979	0.5079	0.2519	0.5776	bflhc/Octen-Embedding-8B	False
German1Retrieval	0.9761	0.9797	nan	0.9771	voyageai/voyage-3-large	False
GermanHealthcare1Retrieval	0.8742	0.9123	nan	0.9140	voyageai/voyage-4-large	False
GermanLegal1Retrieval	0.7149	0.7582	nan	0.7554	voyageai/voyage-4-large	False
HC3FinanceRetrieval	0.7758	0.7739	nan	0.8242	nvidia/NV-Embed-v2	False
HumanEvalRetrieval	0.9910	0.9936	nan	0.9977	bflhc/Octen-Embedding-8B	False
JapaneseCode1Retrieval	0.8650	0.8626	nan	0.8650	google/gemini-embedding-001	False
JapaneseLegal1Retrieval	0.9228	0.8645	nan	0.9228	google/gemini-embedding-001	False
LegalQuAD	0.6553	0.7496	0.4317	0.7675	bm25s	False
LegalSummarization	0.7122	0.7846	0.621	0.7921	voyageai/voyage-3.5	False
MBPPRetrieval	0.9416	0.9608	nan	0.9588	voyageai/voyage-4-large	False
MIRACLRetrievalHardNegatives	0.7042	0.6315	0.5923	0.7305	nvidia/llama-embed-nemotron-8b	False
WikiSQLRetrieval	0.8814	0.9663	nan	0.9892	bflhc/Octen-Embedding-8B	False
Average	0.7602	0.8046	0.42	0.8374	nan	-

Model have high performance on these tasks: German1Retrieval,AppsRetrieval,MBPPRetrieval,EnglishFinance2Retrieval,FinQARetrieval,EnglishFinance1Retrieval,ChatDoctorRetrieval,GermanLegal1Retrieval,DS1000Retrieval,CUREv1,EnglishFinance4Retrieval

KennethEnevoldsen · 2026-01-22T10:42:24Z

I seem to be getting a rate limit when I try to validate the model

# tmp.py
import mteb

model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)")
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())

 /Users/au561649/Github/mteb/.venv/bin/python /Users/au561649/Github/mteb/tmp.py
pa-AzW3IkGcU3rxPOjjUrrdIvx-VA6X94pRsKE7n3e-9Ho
Encoding sentences:   0%|                                                                                                                                        | 0/50 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/Users/au561649/Github/mteb/tmp.py", line 8, in <module>
    # ----
  File "/Users/au561649/Github/mteb/mteb/evaluate.py", line 487, in evaluate
    result = _evaluate_task(
        model=model,
    ...<6 lines>...
        num_proc=num_proc,
    )
  File "/Users/au561649/Github/mteb/mteb/evaluate.py", line 161, in _evaluate_task
    task_results[split] = task.evaluate(
                          ~~~~~~~~~~~~~^
        model,
        ^^^^^^
    ...<4 lines>...
        num_proc=num_proc,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/abstasks/retrieval.py", line 327, in evaluate
    return super().evaluate(
           ~~~~~~~~~~~~~~~~^
        model,
        ^^^^^^
    ...<5 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/abstasks/abstask.py", line 198, in evaluate
    scores[hf_subset] = self._evaluate_subset(
                        ~~~~~~~~~~~~~~~~~~~~~^
        model,
        ^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/abstasks/retrieval.py", line 394, in _evaluate_subset
    results = retriever(
        search_model,
        encode_kwargs=encode_kwargs,
        num_proc=num_proc,
    )
  File "/Users/au561649/Github/mteb/mteb/_evaluators/retrieval_evaluator.py", line 70, in __call__
    return search_model.search(
           ~~~~~~~~~~~~~~~~~~~^
        queries=self.queries,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        num_proc=num_proc,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/models/search_wrappers.py", line 133, in search
    query_embeddings = self.model.encode(
        queries_dataloader,
    ...<4 lines>...
        **encode_kwargs,
    )
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 146, in encode
    return self._batched_encode(sentences, batch_size, input_type)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 180, in _batched_encode
    self._embed_func(
    ~~~~~~~~~~~~~~~~^
        texts=batch,
        ^^^^^^^^^^^^
    ...<3 lines>...
        output_dimension=self.mteb_model_meta.embed_dim,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ).embeddings
    ^
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 96, in wrapper
    result = func(*args, **kwargs)
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 63, in wrapper
    result = func(*args, **kwargs)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/client.py", line 69, in embed
    for attempt in self.retry_controller:
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
    result = action(retry_state)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 420, in exc_check
    raise retry_exc.reraise()
          ~~~~~~~~~~~~~~~~~^^
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 187, in reraise
    raise self.last_attempt.result()
          ~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/au561649/.local/share/uv/python/cpython-3.13.0-macos-aarch64-none/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/Users/au561649/.local/share/uv/python/cpython-3.13.0-macos-aarch64-none/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/client.py", line 71, in embed
    response = voyageai.Embedding.create(
        input=texts,
    ...<5 lines>...
        **self._params,
    )
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/embedding.py", line 20, in create
    response = super().create(*args, **kwargs)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_resource.py", line 47, in create
    response = requestor.request(
        "post",
    ...<4 lines>...
        request_timeout=request_timeout,
    )
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 147, in request
    resp = self._interpret_response(result)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 408, in _interpret_response
    return self._interpret_response_line(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        result.content.decode("utf-8"),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        result.status_code,
        ^^^^^^^^^^^^^^^^^^^
        result.headers,
        ^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 463, in _interpret_response_line
    raise self.handle_error_response(rbody, rcode, resp.data, rheaders)
voyageai.error.RateLimitError: You have not yet added your payment method in the billing page and will have reduced rate limits of 3 RPM and 10K TPM. To unlock our standard rate limits, please add a payment method in the billing page for the appropriate organization in the user dashboard (https://dashboard.voyageai.com/). Even with payment methods entered, the free tokens (200M tokens for Voyage series 3) will still apply. After adding a payment method, you should see your rate limits increase after several minutes. See our pricing docs (https://docs.voyageai.com/docs/pricing) for the free tokens for your model.
Encoding sentences:   0%|                                                                                                                                        | 0/50 [00:19<?, ?it/s]

KennethEnevoldsen · 2026-01-24T16:32:30Z

@fzoll I got a few discrepancies when running checks. Notably AILAStatutes is slightly higher.

import mteb

model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)")
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.50293 (double-checked, not within tolerance)
# ----
# reported score:  	0.4978

task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.86255 (within tolerance)
# ----
# reported score: 0.8629

task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.47487 (within tolerance)
# ----
# reported score: 0.4749

Can I ask you to double check the results?

fzoll · 2026-01-26T15:39:12Z

@KennethEnevoldsen I just re-run the evaluation and updated the results.
Previously, I ran with batch-size=1000; this time, I used the default batch size.

KennethEnevoldsen · 2026-01-27T19:29:08Z

@fzoll

reran it with batch_size=1000 and still get the same result:

import mteb

model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)", encode_kwargs={"batch_size": 1000})
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())

# Results: 0.50293 (batch size=1 000)
# ----
# reported score: 0.4978
# Results: 0.50293 (default batch size)

KennethEnevoldsen · 2026-01-27T19:41:13Z

Same for the other runs as well. Seems like it might have been something other than the batch size. Is it possible that there might have been some sort of an internal change? It could also have been an issue with mteb, but in that case I would like to chase down the discrepancy.

(cc @Samoed)

import mteb

model = mteb.get_model(
    "voyageai/voyage-4-large (embed_dim=2048)", encode_kwargs={"batch_size": 1000}
)
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.50293
# Results: 0.50293 (double-checked with default batch size)
# Results: 0.50293 (batch size 64)
# ----
# reported score: 0.50293
# originally reported score: 0.4978

task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.86255
# Results: 0.86255 (default batch size)
# ----
# reported score:0.86255
# originally reported score: 0.8629

task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.47487 (default batch size)
# ----
# reported score: 0.47487
# originally reported score: 0.4749

Samoed · 2026-01-27T20:09:27Z

Hm, I don't think that batch_size plays any role, because you should pass it in evaluate rather than in get_model

Samoed · 2026-01-27T20:17:34Z

import mteb

for batch_size in [1, 4, 32, 128, 1000]:
    model = mteb.get_model(
        "baseline/random-encoder-baseline"
    )
    task = mteb.get_task("AILAStatutes")
    res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": batch_size})
    print("Batch size", batch_size, "Results: ", res.task_results[0].get_score())

Batch size 1 Results:  0.09799                                                                                                                                                                                                                                                    
Batch size 4 Results:  0.09799
Batch size 32 Results:  0.09799
Batch size 128 Results:  0.09799
Batch size 1000 Results:  0.09799

Same picture with minishlab/potion-multilingual-128M

Batch size 1 Results:  0.16789
Batch size 4 Results:  0.16789
Batch size 32 Results:  0.16789
Batch size 128 Results:  0.16789
Batch size 1000 Results:  0.16789

KennethEnevoldsen · 2026-01-27T20:17:48Z

Ahh yes of course - that is me being stupid, let me rerun it!

KennethEnevoldsen · 2026-01-27T20:20:32Z

updated results:


import mteb

model = mteb.get_model(
    "voyageai/voyage-4-large (embed_dim=2048)",
)
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 1000})
print("Results:", res.task_results[0].get_score())
# Results: 0.50254
# Results: 0.50293 (double-checked with default batch size)
# ----
# reported score: 0.50293
# originally reported score: 0.4978

task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 1000})
print("Results:", res.task_results[0].get_score())
# Results: 0.8629
# Results: 0.86255 (default batch size)
# ----
# reported score:0.86255
# originally reported score: 0.8629

task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 1000})
print("Results:", res.task_results[0].get_score())
# Results: 0.47487
# Results: 0.47487 (default batch size)
# ----
# reported score: 0.47487
# originally reported score: 0.4749

so some variation but not within the tolerance.

KennethEnevoldsen · 2026-01-27T20:41:01Z

Yeah so all difference at the moment are below 0.001. So I think we are good. I will just make a issue to discuss what we do in case they are:
embeddings-benchmark/mteb#4009

Otherwise I think this is good to merge - thanks for taking the time @fzoll

Adding voyage-4-large (2048d) results

ecb8744

Adding voyage-4-large 2048d results

ea6e3c4

KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Jan 19, 2026

Adding voyage-4-large 2048d results

444e65b

KennethEnevoldsen changed the title ~~Adding voyage-4-large (2048d) results~~ Adding voyage-4-large (embed_dim=2048) results Jan 20, 2026

KennethEnevoldsen removed the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Jan 20, 2026

KennethEnevoldsen changed the title ~~Adding voyage-4-large (embed_dim=2048) results~~ Adding voyageai/voyage-4-large (embed_dim=2048) results Jan 20, 2026

Adding voyage-4-large 2048d results

a139a09

Adding voyage-4-large 2048d results

ac453d0

KennethEnevoldsen mentioned this pull request Jan 27, 2026

Needs a decision: When model performance vary by batch size, what is then the reference? embeddings-benchmark/mteb#4009

Open

KennethEnevoldsen merged commit 6332c3d into embeddings-benchmark:main Jan 27, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Adding `voyageai/voyage-4-large (embed_dim=2048)` results#404

Adding `voyageai/voyage-4-large (embed_dim=2048)` results#404
KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
fzoll:voyage-4-large-2048d

fzoll commented Jan 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

KennethEnevoldsen commented Jan 22, 2026

Uh oh!

KennethEnevoldsen commented Jan 24, 2026

Uh oh!

fzoll commented Jan 26, 2026 •

edited

Loading

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026 •

edited

Loading

Uh oh!

Samoed commented Jan 27, 2026

Uh oh!

Samoed commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

fzoll commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Results Comparison

Results for voyageai/voyage-4-large (embed_dim=2048)

Uh oh!

KennethEnevoldsen commented Jan 22, 2026

Uh oh!

KennethEnevoldsen commented Jan 24, 2026

Uh oh!

fzoll commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 27, 2026

Uh oh!

Samoed commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

KennethEnevoldsen commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fzoll commented Jan 19, 2026 •

edited

Loading

github-actions bot commented Jan 19, 2026 •

edited

Loading

Results for `voyageai/voyage-4-large (embed_dim=2048)`

fzoll commented Jan 26, 2026 •

edited

Loading

KennethEnevoldsen commented Jan 27, 2026 •

edited

Loading