Adding voyageai/voyage-4-large (embed_dim=2048) results#404
Adding voyageai/voyage-4-large (embed_dim=2048) results#404KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
voyageai/voyage-4-large (embed_dim=2048) results#404Conversation
Model Results ComparisonReference models: Results for
|
| task_name | google/gemini-embedding-001 | voyageai/voyage-4-large (embed_dim=2048) | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| AILACasedocs | 0.4833 | 0.4749 | 0.2643 | 0.6541 | bflhc/Octen-Embedding-8B | False |
| AILAStatutes | 0.4877 | 0.5029 | 0.2084 | 0.9313 | bflhc/Octen-Embedding-8B | False |
| AppsRetrieval | 0.9375 | 0.9729 | 0.3255 | 0.9722 | voyageai/voyage-4-large | False |
| CUREv1 | 0.5957 | 0.6782 | 0.5162 | 0.6694 | voyageai/voyage-4-large | False |
| ChatDoctorRetrieval | 0.7352 | 0.7722 | 0.5687 | 0.7674 | voyageai/voyage-4-large | False |
| Code1Retrieval | 0.9474 | 0.9452 | nan | 0.9474 | google/gemini-embedding-001 | False |
| DS1000Retrieval | 0.6870 | 0.7129 | nan | 0.7117 | voyageai/voyage-4-large | False |
| EnglishFinance1Retrieval | 0.7332 | 0.8428 | nan | 0.8218 | voyageai/voyage-4-large | False |
| EnglishFinance2Retrieval | 0.6740 | 0.9137 | nan | 0.9099 | voyageai/voyage-4-large | False |
| EnglishFinance3Retrieval | 0.8330 | 0.8361 | nan | 0.8509 | nvidia/NV-Embed-v2 | False |
| EnglishFinance4Retrieval | 0.5757 | 0.6241 | nan | 0.6198 | voyageai/voyage-4-large | False |
| EnglishHealthcare1Retrieval | 0.6338 | 0.6828 | nan | 0.6875 | bm25s | False |
| FinQARetrieval | 0.6464 | 0.8897 | nan | 0.8865 | voyageai/voyage-4-large | False |
| FinanceBenchRetrieval | 0.9157 | 0.9315 | nan | 0.9459 | bflhc/Octen-Embedding-8B | False |
| French1Retrieval | 0.8781 | 0.8653 | nan | 0.8884 | Cohere/Cohere-embed-v4.0 | False |
| FrenchLegal1Retrieval | 0.8696 | 0.9426 | nan | 0.9490 | bm25s | False |
| FreshStackRetrieval | 0.3979 | 0.5079 | 0.2519 | 0.5776 | bflhc/Octen-Embedding-8B | False |
| German1Retrieval | 0.9761 | 0.9797 | nan | 0.9771 | voyageai/voyage-3-large | False |
| GermanHealthcare1Retrieval | 0.8742 | 0.9123 | nan | 0.9140 | voyageai/voyage-4-large | False |
| GermanLegal1Retrieval | 0.7149 | 0.7582 | nan | 0.7554 | voyageai/voyage-4-large | False |
| HC3FinanceRetrieval | 0.7758 | 0.7739 | nan | 0.8242 | nvidia/NV-Embed-v2 | False |
| HumanEvalRetrieval | 0.9910 | 0.9936 | nan | 0.9977 | bflhc/Octen-Embedding-8B | False |
| JapaneseCode1Retrieval | 0.8650 | 0.8626 | nan | 0.8650 | google/gemini-embedding-001 | False |
| JapaneseLegal1Retrieval | 0.9228 | 0.8645 | nan | 0.9228 | google/gemini-embedding-001 | False |
| LegalQuAD | 0.6553 | 0.7496 | 0.4317 | 0.7675 | bm25s | False |
| LegalSummarization | 0.7122 | 0.7846 | 0.621 | 0.7921 | voyageai/voyage-3.5 | False |
| MBPPRetrieval | 0.9416 | 0.9608 | nan | 0.9588 | voyageai/voyage-4-large | False |
| MIRACLRetrievalHardNegatives | 0.7042 | 0.6315 | 0.5923 | 0.7305 | nvidia/llama-embed-nemotron-8b | False |
| WikiSQLRetrieval | 0.8814 | 0.9663 | nan | 0.9892 | bflhc/Octen-Embedding-8B | False |
| Average | 0.7602 | 0.8046 | 0.42 | 0.8374 | nan | - |
Model have high performance on these tasks: German1Retrieval,AppsRetrieval,MBPPRetrieval,EnglishFinance2Retrieval,FinQARetrieval,EnglishFinance1Retrieval,ChatDoctorRetrieval,GermanLegal1Retrieval,DS1000Retrieval,CUREv1,EnglishFinance4Retrieval
voyage-4-large (embed_dim=2048) results
voyage-4-large (embed_dim=2048) resultsvoyageai/voyage-4-large (embed_dim=2048) results
|
I seem to be getting a rate limit when I try to validate the model # tmp.py
import mteb
model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)")
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score()) /Users/au561649/Github/mteb/.venv/bin/python /Users/au561649/Github/mteb/tmp.py
pa-AzW3IkGcU3rxPOjjUrrdIvx-VA6X94pRsKE7n3e-9Ho
Encoding sentences: 0%| | 0/50 [00:00<?, ?it/s]Traceback (most recent call last):
File "/Users/au561649/Github/mteb/tmp.py", line 8, in <module>
# ----
File "/Users/au561649/Github/mteb/mteb/evaluate.py", line 487, in evaluate
result = _evaluate_task(
model=model,
...<6 lines>...
num_proc=num_proc,
)
File "/Users/au561649/Github/mteb/mteb/evaluate.py", line 161, in _evaluate_task
task_results[split] = task.evaluate(
~~~~~~~~~~~~~^
model,
^^^^^^
...<4 lines>...
num_proc=num_proc,
^^^^^^^^^^^^^^^^^^
)
^
File "/Users/au561649/Github/mteb/mteb/abstasks/retrieval.py", line 327, in evaluate
return super().evaluate(
~~~~~~~~~~~~~~~~^
model,
^^^^^^
...<5 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/Users/au561649/Github/mteb/mteb/abstasks/abstask.py", line 198, in evaluate
scores[hf_subset] = self._evaluate_subset(
~~~~~~~~~~~~~~~~~~~~~^
model,
^^^^^^
...<6 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/Users/au561649/Github/mteb/mteb/abstasks/retrieval.py", line 394, in _evaluate_subset
results = retriever(
search_model,
encode_kwargs=encode_kwargs,
num_proc=num_proc,
)
File "/Users/au561649/Github/mteb/mteb/_evaluators/retrieval_evaluator.py", line 70, in __call__
return search_model.search(
~~~~~~~~~~~~~~~~~~~^
queries=self.queries,
^^^^^^^^^^^^^^^^^^^^^
...<6 lines>...
num_proc=num_proc,
^^^^^^^^^^^^^^^^^^
)
^
File "/Users/au561649/Github/mteb/mteb/models/search_wrappers.py", line 133, in search
query_embeddings = self.model.encode(
queries_dataloader,
...<4 lines>...
**encode_kwargs,
)
File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 146, in encode
return self._batched_encode(sentences, batch_size, input_type)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 180, in _batched_encode
self._embed_func(
~~~~~~~~~~~~~~~~^
texts=batch,
^^^^^^^^^^^^
...<3 lines>...
output_dimension=self.mteb_model_meta.embed_dim,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
).embeddings
^
File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 96, in wrapper
result = func(*args, **kwargs)
File "/Users/au561649/Github/mteb/mteb/models/model_implementations/voyage_models.py", line 63, in wrapper
result = func(*args, **kwargs)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/client.py", line 69, in embed
for attempt in self.retry_controller:
^^^^^^^^^^^^^^^^^^^^^
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 445, in __iter__
do = self.iter(retry_state=retry_state)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 378, in iter
result = action(retry_state)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 420, in exc_check
raise retry_exc.reraise()
~~~~~~~~~~~~~~~~~^^
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/tenacity/__init__.py", line 187, in reraise
raise self.last_attempt.result()
~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/Users/au561649/.local/share/uv/python/cpython-3.13.0-macos-aarch64-none/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/Users/au561649/.local/share/uv/python/cpython-3.13.0-macos-aarch64-none/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/client.py", line 71, in embed
response = voyageai.Embedding.create(
input=texts,
...<5 lines>...
**self._params,
)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/embedding.py", line 20, in create
response = super().create(*args, **kwargs)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_resource.py", line 47, in create
response = requestor.request(
"post",
...<4 lines>...
request_timeout=request_timeout,
)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 147, in request
resp = self._interpret_response(result)
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 408, in _interpret_response
return self._interpret_response_line(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
result.content.decode("utf-8"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
result.status_code,
^^^^^^^^^^^^^^^^^^^
result.headers,
^^^^^^^^^^^^^^^
)
^
File "/Users/au561649/Github/mteb/.venv/lib/python3.13/site-packages/voyageai/api_resources/api_requestor.py", line 463, in _interpret_response_line
raise self.handle_error_response(rbody, rcode, resp.data, rheaders)
voyageai.error.RateLimitError: You have not yet added your payment method in the billing page and will have reduced rate limits of 3 RPM and 10K TPM. To unlock our standard rate limits, please add a payment method in the billing page for the appropriate organization in the user dashboard (https://dashboard.voyageai.com/). Even with payment methods entered, the free tokens (200M tokens for Voyage series 3) will still apply. After adding a payment method, you should see your rate limits increase after several minutes. See our pricing docs (https://docs.voyageai.com/docs/pricing) for the free tokens for your model.
Encoding sentences: 0%| | 0/50 [00:19<?, ?it/s] |
|
@fzoll I got a few discrepancies when running checks. Notably import mteb
model = mteb.get_model("voyageai/voyage-4-large (embed_dim=2048)")
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.50293 (double-checked, not within tolerance)
# ----
# reported score: 0.4978
task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.86255 (within tolerance)
# ----
# reported score: 0.8629
task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task)
print("Results:", res.task_results[0].get_score())
# Results: 0.47487 (within tolerance)
# ----
# reported score: 0.4749Can I ask you to double check the results? |
|
@KennethEnevoldsen I just re-run the evaluation and updated the results. |
|
reran it with batch_size=1000 and still get the same result: |
|
Same for the other runs as well. Seems like it might have been something other than the batch size. Is it possible that there might have been some sort of an internal change? It could also have been an issue with (cc @Samoed) import mteb
model = mteb.get_model(
"voyageai/voyage-4-large (embed_dim=2048)", encode_kwargs={"batch_size": 1000}
)
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.50293
# Results: 0.50293 (double-checked with default batch size)
# Results: 0.50293 (batch size 64)
# ----
# reported score: 0.50293
# originally reported score: 0.4978
task = mteb.get_task("JapaneseCode1Retrieval")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.86255
# Results: 0.86255 (default batch size)
# ----
# reported score:0.86255
# originally reported score: 0.8629
task = mteb.get_task("AILACasedocs")
res = mteb.evaluate(model, task, overwrite_strategy="always")
print("Results:", res.task_results[0].get_score())
# Results: 0.47487 (default batch size)
# ----
# reported score: 0.47487
# originally reported score: 0.4749 |
|
Hm, I don't think that batch_size plays any role, because you should pass it in |
import mteb
for batch_size in [1, 4, 32, 128, 1000]:
model = mteb.get_model(
"baseline/random-encoder-baseline"
)
task = mteb.get_task("AILAStatutes")
res = mteb.evaluate(model, task, overwrite_strategy="always", encode_kwargs={"batch_size": batch_size})
print("Batch size", batch_size, "Results: ", res.task_results[0].get_score())Same picture with |
|
Ahh yes of course - that is me being stupid, let me rerun it! |
|
updated results: so some variation but not within the tolerance. |
|
Yeah so all difference at the moment are below 0.001. So I think we are good. I will just make a issue to discuss what we do in case they are: Otherwise I think this is good to merge - thanks for taking the time @fzoll |
Adding voyage-4-large (2048d) results
Checklist
mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here