Skip to content

Comments

Adding voyageai/voyage-4-large (embed_dim=2048) results#404

Merged
KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
fzoll:voyage-4-large-2048d
Jan 27, 2026
Merged

Adding voyageai/voyage-4-large (embed_dim=2048) results#404
KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
fzoll:voyage-4-large-2048d

Conversation

@fzoll
Copy link
Contributor

@fzoll fzoll commented Jan 19, 2026

Adding voyage-4-large (2048d) results

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

@github-actions
Copy link

github-actions bot commented Jan 19, 2026

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: voyageai/voyage-4-large (embed_dim=2048)
Tasks: AILACasedocs, AILAStatutes, AppsRetrieval, CUREv1, ChatDoctorRetrieval, Code1Retrieval, DS1000Retrieval, EnglishFinance1Retrieval, EnglishFinance2Retrieval, EnglishFinance3Retrieval, EnglishFinance4Retrieval, EnglishHealthcare1Retrieval, FinQARetrieval, FinanceBenchRetrieval, French1Retrieval, FrenchLegal1Retrieval, FreshStackRetrieval, German1Retrieval, GermanHealthcare1Retrieval, GermanLegal1Retrieval, HC3FinanceRetrieval, HumanEvalRetrieval, JapaneseCode1Retrieval, JapaneseLegal1Retrieval, LegalQuAD, LegalSummarization, MBPPRetrieval, MIRACLRetrievalHardNegatives, WikiSQLRetrieval

Results for voyageai/voyage-4-large (embed_dim=2048)

task_name google/gemini-embedding-001 voyageai/voyage-4-large (embed_dim=2048) intfloat/multilingual-e5-large Max result Model with max result In Training Data
AILACasedocs 0.4833 0.4749 0.2643 0.6541 bflhc/Octen-Embedding-8B False
AILAStatutes 0.4877 0.5029 0.2084 0.9313 bflhc/Octen-Embedding-8B False
AppsRetrieval 0.9375 0.9729 0.3255 0.9722 voyageai/voyage-4-large False
CUREv1 0.5957 0.6782 0.5162 0.6694 voyageai/voyage-4-large False
ChatDoctorRetrieval 0.7352 0.7722 0.5687 0.7674 voyageai/voyage-4-large False
Code1Retrieval 0.9474 0.9452 nan 0.9474 google/gemini-embedding-001 False
DS1000Retrieval 0.6870 0.7129 nan 0.7117 voyageai/voyage-4-large False
EnglishFinance1Retrieval 0.7332 0.8428 nan 0.8218 voyageai/voyage-4-large False
EnglishFinance2Retrieval 0.6740 0.9137 nan 0.9099 voyageai/voyage-4-large False
EnglishFinance3Retrieval 0.8330 0.8361 nan 0.8509 nvidia/NV-Embed-v2 False
EnglishFinance4Retrieval 0.5757 0.6241 nan 0.6198 voyageai/voyage-4-large False
EnglishHealthcare1Retrieval 0.6338 0.6828 nan 0.6875 bm25s False
FinQARetrieval 0.6464 0.8897 nan 0.8865 voyageai/voyage-4-large False
FinanceBenchRetrieval 0.9157 0.9315 nan 0.9459 bflhc/Octen-Embedding-8B False
French1Retrieval 0.8781 0.8653 nan 0.8884 Cohere/Cohere-embed-v4.0 False
FrenchLegal1Retrieval 0.8696 0.9426 nan 0.9490 bm25s False
FreshStackRetrieval 0.3979 0.5079 0.2519 0.5776 bflhc/Octen-Embedding-8B False
German1Retrieval 0.9761 0.9797 nan 0.9771 voyageai/voyage-3-large False
GermanHealthcare1Retrieval 0.8742 0.9123 nan 0.9140 voyageai/voyage-4-large False
GermanLegal1Retrieval 0.7149 0.7582 nan 0.7554 voyageai/voyage-4-large False
HC3FinanceRetrieval 0.7758 0.7739 nan 0.8242 nvidia/NV-Embed-v2 False
HumanEvalRetrieval 0.9910 0.9936 nan 0.9977 bflhc/Octen-Embedding-8B False
JapaneseCode1Retrieval 0.8650 0.8626 nan 0.8650 google/gemini-embedding-001 False
JapaneseLegal1Retrieval 0.9228 0.8645 nan 0.9228 google/gemini-embedding-001 False
LegalQuAD 0.6553 0.7496 0.4317 0.7675 bm25s False
LegalSummarization 0.7122 0.7846 0.621 0.7921 voyageai/voyage-3.5 False
MBPPRetrieval 0.9416 0.9608 nan 0.9588 voyageai/voyage-4-large False
MIRACLRetrievalHardNegatives 0.7042 0.6315 0.5923 0.7305 nvidia/llama-embed-nemotron-8b False
WikiSQLRetrieval 0.8814 0.9663 nan 0.9892 bflhc/Octen-Embedding-8B False
Average 0.7602 0.8046 0.42 0.8374 nan -

Model have high performance on these tasks: German1Retrieval,AppsRetrieval,MBPPRetrieval,EnglishFinance2Retrieval,FinQARetrieval,EnglishFinance1Retrieval,ChatDoctorRetrieval,GermanLegal1Retrieval,DS1000Retrieval,CUREv1,EnglishFinance4Retrieval


@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Jan 19, 2026
@KennethEnevoldsen KennethEnevoldsen changed the title Adding voyage-4-large (2048d) results Adding voyage-4-large (embed_dim=2048) results Jan 20, 2026
@KennethEnevoldsen KennethEnevoldsen removed the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Jan 20, 2026
@KennethEnevoldsen KennethEnevoldsen changed the title Adding voyage-4-large (embed_dim=2048) results Adding voyageai/voyage-4-large (embed_dim=2048) results Jan 20, 2026
@KennethEnevoldsen
Copy link
Contributor

I seem to be getting a rate limit when I try to validate the model

# tmp.py
import mteb

model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)")
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
 /Users/au561649/Github/mteb/.venv/bin/python /Users/au561649/Github/mteb/tmp.py
pa-AzW3IkGcU3rxPOjjUrrdIvx-VA6X94pRsKE7n3e-9Ho
Encoding sentences:   0%|                                                                                                                                        | 0/50 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/Users/au561649/Github/mteb/tmp.py", line 8, in <module>
    # ----
  File "/Users/au561649/Github/mteb/mteb/evaluate.py", line 487, in evaluate
    result = _evaluate_task(
        model=model,
    ...<6 lines>...
        num_proc=num_proc,
    )
  File "/Users/au561649/Github/mteb/mteb/evaluate.py", line 161, in _evaluate_task
    task_results[split] = task.evaluate(
                          ~~~~~~~~~~~~~^
        model,
        ^^^^^^
    ...<4 lines>...
        num_proc=num_proc,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/abstasks/retrieval.py", line 327, in evaluate
    return super().evaluate(
           ~~~~~~~~~~~~~~~~^
        model,
        ^^^^^^
    ...<5 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/abstasks/abstask.py", line 198, in evaluate
    scores[hf_subset] = self._evaluate_subset(
                        ~~~~~~~~~~~~~~~~~~~~~^
        model,
        ^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/abstasks/retrieval.py", line 394, in _evaluate_subset
    results = retriever(
        search_model,
        encode_kwargs=encode_kwargs,
        num_proc=num_proc,
    )
  File "/Users/au561649/Github/mteb/mteb/_evaluators/retrieval_evaluator.py", line 70, in __call__
    return search_model.search(
           ~~~~~~~~~~~~~~~~~~~^
        queries=self.queries,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        num_proc=num_proc,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/mteb/models/search_wrappers.py", line 133, in search
    query_embeddings = self.model.encode(
        queries_dataloader,
    ...<4 lines>...
        **encode_kwargs,
    )
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 146, in encode
    return self._batched_encode(sentences, batch_size, input_type)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 180, in _batched_encode
    self._embed_func(
    ~~~~~~~~~~~~~~~~^
        texts=batch,
        ^^^^^^^^^^^^
    ...<3 lines>...
        output_dimension=self.mteb_model_meta.embed_dim,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ).embeddings
    ^
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 96, in wrapper
    result = func(*args, **kwargs)
  File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 63, in wrapper
    result = func(*args, **kwargs)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/client.py", line 69, in embed
    for attempt in self.retry_controller:
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
    result = action(retry_state)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 420, in exc_check
    raise retry_exc.reraise()
          ~~~~~~~~~~~~~~~~~^^
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 187, in reraise
    raise self.last_attempt.result()
          ~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/au561649/.local/share/uv/python/cpython-3.13.0-macos-aarch64-none/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/Users/au561649/.local/share/uv/python/cpython-3.13.0-macos-aarch64-none/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/client.py", line 71, in embed
    response = voyageai.Embedding.create(
        input=texts,
    ...<5 lines>...
        **self._params,
    )
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/embedding.py", line 20, in create
    response = super().create(*args, **kwargs)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_resource.py", line 47, in create
    response = requestor.request(
        "post",
    ...<4 lines>...
        request_timeout=request_timeout,
    )
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 147, in request
    resp = self._interpret_response(result)
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 408, in _interpret_response
    return self._interpret_response_line(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        result.content.decode("utf-8"),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        result.status_code,
        ^^^^^^^^^^^^^^^^^^^
        result.headers,
        ^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 463, in _interpret_response_line
    raise self.handle_error_response(rbody, rcode, resp.data, rheaders)
voyageai.error.RateLimitError: You have not yet added your payment method in the billing page and will have reduced rate limits of 3 RPM and 10K TPM. To unlock our standard rate limits, please add a payment method in the billing page for the appropriate organization in the user dashboard (https://dashboard.voyageai.com/). Even with payment methods entered, the free tokens (200M tokens for Voyage series 3) will still apply. After adding a payment method, you should see your rate limits increase after several minutes. See our pricing docs (https://docs.voyageai.com/docs/pricing) for the free tokens for your model.
Encoding sentences:   0%|                                                                                                                                        | 0/50 [00:19<?, ?it/s]

@KennethEnevoldsen
Copy link
Contributor

@fzoll I got a few discrepancies when running checks. Notably AILAStatutes is slightly higher.

import mteb

model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)")
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.50293 (double-checked, not within tolerance)
# ----
# reported score:  	0.4978

task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.86255 (within tolerance)
# ----
# reported score: 0.8629

task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.47487 (within tolerance)
# ----
# reported score: 0.4749

Can I ask you to double check the results?

@fzoll
Copy link
Contributor Author

fzoll commented Jan 26, 2026

@KennethEnevoldsen I just re-run the evaluation and updated the results.
Previously, I ran with batch-size=1000; this time, I used the default batch size.

@KennethEnevoldsen
Copy link
Contributor

@fzoll

reran it with batch_size=1000 and still get the same result:

import mteb

model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)", encode_kwargs={"batch_size": 1000})
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())

# Results: 0.50293 (batch size=1 000)
# ----
# reported score: 0.4978
# Results: 0.50293 (default batch size)

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jan 27, 2026

Same for the other runs as well. Seems like it might have been something other than the batch size. Is it possible that there might have been some sort of an internal change? It could also have been an issue with mteb, but in that case I would like to chase down the discrepancy.

(cc @Samoed)

import mteb

model = mteb.get_model(
    "voyageai/voyage-4-large (embed_dim=2048)", encode_kwargs={"batch_size": 1000}
)
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.50293
# Results: 0.50293 (double-checked with default batch size)
# Results: 0.50293 (batch size 64)
# ----
# reported score: 0.50293
# originally reported score: 0.4978

task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.86255
# Results: 0.86255 (default batch size)
# ----
# reported score:0.86255
# originally reported score: 0.8629

task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.47487 (default batch size)
# ----
# reported score: 0.47487
# originally reported score: 0.4749

@Samoed
Copy link
Member

Samoed commented Jan 27, 2026

Hm, I don't think that batch_size plays any role, because you should pass it in evaluate rather than in get_model

@Samoed
Copy link
Member

Samoed commented Jan 27, 2026

import mteb

for batch_size in [1, 4, 32, 128, 1000]:
    model = mteb.get_model(
        "baseline/random-encoder-baseline"
    )
    task = mteb.get_task("AILAStatutes")
    res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": batch_size})
    print("Batch size", batch_size, "Results: ", res.task_results[0].get_score())
Batch size 1 Results:  0.09799                                                                                                                                                                                                                                                    
Batch size 4 Results:  0.09799
Batch size 32 Results:  0.09799
Batch size 128 Results:  0.09799
Batch size 1000 Results:  0.09799

Same picture with minishlab/potion-multilingual-128M

Batch size 1 Results:  0.16789
Batch size 4 Results:  0.16789
Batch size 32 Results:  0.16789
Batch size 128 Results:  0.16789
Batch size 1000 Results:  0.16789

@KennethEnevoldsen
Copy link
Contributor

Ahh yes of course - that is me being stupid, let me rerun it!

@KennethEnevoldsen
Copy link
Contributor

updated results:


import mteb

model = mteb.get_model(
    "voyageai/voyage-4-large (embed_dim=2048)",
)
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 1000})
print("Results:", res.task_results[0].get_score())
# Results: 0.50254
# Results: 0.50293 (double-checked with default batch size)
# ----
# reported score: 0.50293
# originally reported score: 0.4978

task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 1000})
print("Results:", res.task_results[0].get_score())
# Results: 0.8629
# Results: 0.86255 (default batch size)
# ----
# reported score:0.86255
# originally reported score: 0.8629

task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": 1000})
print("Results:", res.task_results[0].get_score())
# Results: 0.47487
# Results: 0.47487 (default batch size)
# ----
# reported score: 0.47487
# originally reported score: 0.4749

so some variation but not within the tolerance.

@KennethEnevoldsen
Copy link
Contributor

Yeah so all difference at the moment are below 0.001. So I think we are good. I will just make a issue to discuss what we do in case they are:
embeddings-benchmark/mteb#4009

Otherwise I think this is good to merge - thanks for taking the time @fzoll

@KennethEnevoldsen KennethEnevoldsen merged commit 6332c3d into embeddings-benchmark:main Jan 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants