Skip to content

Comments

add RTEB & MTEB results of Octen-Embedding-8B#374

Merged
Samoed merged 2 commits intoembeddings-benchmark:mainfrom
bflhc:feature/add_octen_results
Dec 24, 2025
Merged

add RTEB & MTEB results of Octen-Embedding-8B#374
Samoed merged 2 commits intoembeddings-benchmark:mainfrom
bflhc:feature/add_octen_results

Conversation

@bflhc
Copy link
Contributor

@bflhc bflhc commented Dec 23, 2025

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: bflhc/Octen-Embedding-8B
Tasks: AILACasedocs, AILAStatutes, AfriSentiClassification, AlloProfClusteringS2S.v2, AlloprofReranking, AmazonCounterfactualClassification, AppsRetrieval, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArmenianParaphrasePC, BUCC.v2, BelebeleRetrieval, BibleNLPBitextMining, BigPatentClustering.v2, BiorxivClusteringP2P.v2, BornholmBitextMining, BrazilianToxicTweetsClassification, BulgarianStoreReviewSentimentClassfication, CEDRClassification, CLSClusteringP2P.v2, CSFDSKMovieReviewSentimentClassification, CTKFactsNLI, CUREv1, CataloniaTweetClassification, ChatDoctorRetrieval, Core17InstructionRetrieval, CovidRetrieval, CyrillicTurkicLangClassification, CzechProductReviewSentimentClassification, DBpediaClassification, DS1000Retrieval, DalajClassification, DiaBlaBitextMining, EstonianValenceClassification, FaroeseSTS, FilipinoShopeeReviewsClassification, FinParaSTS, FinQARetrieval, FinanceBenchRetrieval, FinancialPhrasebankClassification, FloresBitextMining, FreshStackRetrieval, GermanSTSBenchmark, GreekLegalCodeClassification, GujaratiNewsClassification, HALClusteringS2S.v2, HC3FinanceRetrieval, HagridRetrieval, HumanEvalRetrieval, IN22GenBitextMining, IndicCrosslingualSTS, IndicGenBenchFloresBitextMining, IndicLangClassification, IndonesianIdClickbaitClassification, IsiZuluNewsClassification, ItaCaseholdClassification, JSICK, KorHateSpeechMLClassification, KorSarcasmClassification, KurdishSentimentClassification, LEMBPasskeyRetrieval, LegalBenchCorporateLobbying, LegalQuAD, LegalSummarization, MBPPRetrieval, MIRACLRetrievalHardNegatives, MLQARetrieval, MacedonianTweetSentimentClassification, MalteseNewsClassification, MasakhaNEWSClassification, MasakhaNEWSClusteringS2S, MassiveIntentClassification, MedrxivClusteringP2P.v2, MultiEURLEXMultilabelClassification, MultiHateClassification, NTREXBitextMining, NepaliNewsClassification, News21InstructionRetrieval, NollySentiBitextMining, NordicLangClassification, NorwegianCourtsBitextMining, NusaParagraphEmotionClassification, NusaTranslationBitextMining, NusaX-senti, NusaXBitextMining, OdiaNewsClassification, OpusparcusPC, PAC, PawsXPairClassification, PlscClusteringP2P.v2, PoemSentimentClassification, PolEmo2.0-OUT, PpcPC, PunjabiNewsClassification, RTE3, Robust04InstructionRetrieval, RomaniBibleClustering, RuBQReranking, SCIDOCS, SIB200ClusteringS2S, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, STSES, ScalaClassification, SemRel24STS, SentimentAnalysisHindi, SinhalaNewsClassification, SiswatiNewsClassification, SlovakMovieReviewSentimentClassification, SpartQA, SprintDuplicateQuestions, StackExchangeClustering.v2, StackOverflowQA, StatcanDialogueDatasetRetrieval, SwahiliNewsClassification, SwednClusteringP2P, SwissJudgementClassification, T2Reranking, TERRa, TRECCOVID, Tatoeba, TempReasonL1, ToxicConversationsClassification, TswanaNewsClassification, TweetTopicSingleClassification, TwitterHjerneRetrieval, TwitterURLCorpus, VoyageMMarcoReranking, WebLINXCandidatesReranking, WikiCitiesClustering, WikiClusteringP2P.v2, WikiSQLRetrieval, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual, WinoGrande, XNLI, indonli

Results for bflhc/Octen-Embedding-8B

task_name bflhc/Octen-Embedding-8B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result
AILACasedocs 0.6109 0.4833 0.2643 0.4833 google/gemini-embedding-001
AILAStatutes 0.9085 0.4877 0.2084 0.9003 Mira190/Euler-Legal-Embedding-V1
AfriSentiClassification 0.4599 0.5356 0.455 0.5688 tencent/KaLM-Embedding-Gemma3-12B-2511
AlloProfClusteringS2S.v2 0.5860 0.5636 0.3515 0.5965 Qwen/Qwen3-Embedding-8B
AlloprofReranking 0.8540 0.8177 0.6944 0.8513 Qwen/Qwen3-Embedding-4B
AmazonCounterfactualClassification 0.9249 0.8820 0.7935 0.9696 GeoGPT-Research-Project/GeoEmbedding
AppsRetrieval 0.9206 0.9375 0.3255 0.9463 voyageai/voyage-3-large
ArXivHierarchicalClusteringP2P 0.6472 0.6492 0.5569 0.6869 NovaSearch/jasper_en_vision_language_v1
ArXivHierarchicalClusteringS2S 0.6455 0.6384 0.5621 0.6548 Qwen/Qwen3-Embedding-8B
ArguAna 0.7831 0.8644 0.5438 0.8979 voyageai/voyage-3-m-exp
ArmenianParaphrasePC 0.9680 0.9689 0.9493 0.9703 tencent/KaLM-Embedding-Gemma3-12B-2511
BUCC.v2 0.9893 0.9899 0.9878 0.9902 GritLM/GritLM-7B
BelebeleRetrieval 0.8899 0.9073 0.7791 0.9380 clips/e5-base-trm-nl
BibleNLPBitextMining 0.2633 0.2072 0.1665 0.9899 deepvk/USER-bge-m3
BigPatentClustering.v2 0.3146 0.3806 0.3466 0.4453 Salesforce/SFR-Embedding-2_R
BiorxivClusteringP2P.v2 0.5088 0.5386 0.3778 0.8417 codefuse-ai/F2LLM-4B
BornholmBitextMining 0.7603 0.5169 0.4416 0.7633 Qwen/Qwen3-Embedding-8B
BrazilianToxicTweetsClassification 0.2100 0.2802 0.2123 0.3157 tencent/KaLM-Embedding-Gemma3-12B-2511
BulgarianStoreReviewSentimentClassfication 0.6901 0.7813 0.7093 0.8044 Linq-AI-Research/Linq-Embed-Mistral
CEDRClassification 0.5256 0.5742 0.4484 0.7301 sergeyzh/BERTA
CLSClusteringP2P.v2 0.7489 0.4268 0.4037 0.7572 Qwen/Qwen3-Embedding-8B
CSFDSKMovieReviewSentimentClassification 0.4966 0.4938 0.3664 0.6456 tencent/KaLM-Embedding-Gemma3-12B-2511
CTKFactsNLI 0.8667 0.8759 0.8096 0.8993 omarelshehy/arabic-english-sts-matryoshka
CUREv1 0.5949 0.5957 0.5162 0.6289 nvidia/NV-Embed-v2
CataloniaTweetClassification 0.5265 0.5451 0.504 0.7790 Bytedance/Seed1.6-embedding-1215
ChatDoctorRetrieval 0.7339 0.7352 0.5687 0.7390 voyageai/voyage-3-large
Core17InstructionRetrieval 0.1180 0.0769 -0.0162 0.1648 jhu-clsp/FollowIR-7B
CovidRetrieval 0.8623 0.7913 0.7561 0.9606 TencentBAC/Conan-embedding-v2
CyrillicTurkicLangClassification 0.6366 0.9530 0.4085 0.9905 tencent/KaLM-Embedding-Gemma3-12B-2511
CzechProductReviewSentimentClassification 0.5786 0.6816 0.5742 0.7667 Bytedance/Seed1.6-embedding-1215
DBpediaClassification 0.9664 0.9476 0.8828 0.9926 Qwen/Qwen3-Embedding-8B
DS1000Retrieval 0.6988 0.6870 nan 0.6897 voyageai/voyage-3-large
DalajClassification 0.5105 0.5047 0.5001 0.6213 tencent/KaLM-Embedding-Gemma3-12B-2511
DiaBlaBitextMining 0.8686 0.8723 0.8483 0.8865 nvidia/llama-embed-nemotron-8b
EstonianValenceClassification 0.4342 0.5352 0.4358 0.6456 tencent/KaLM-Embedding-Gemma3-12B-2511
FaroeseSTS 0.8671 0.8612 0.7239 0.9739 Gameselo/STS-multilingual-mpnet-base-v2
FilipinoShopeeReviewsClassification 0.3990 0.4845 0.3527 0.5159 tencent/KaLM-Embedding-Gemma3-12B-2511
FinParaSTS 0.3456 0.2860 0.2666 0.3399 Qwen/Qwen3-Embedding-4B
FinQARetrieval 0.7842 0.6464 nan 0.8552 voyageai/voyage-3.5 (output_dtype=int8)
FinanceBenchRetrieval 0.9306 0.9157 nan 0.9298 voyageai/voyage-3-large
FinancialPhrasebankClassification 0.8493 0.8864 0.8404 0.9515 Qwen/Qwen3-Embedding-8B
FloresBitextMining 0.7639 0.8371 0.8108 0.8596 intfloat/multilingual-e5-large-instruct
FreshStackRetrieval 0.5126 0.3979 0.2519 0.4438 voyageai/voyage-3-large
GermanSTSBenchmark 0.9003 0.8809 0.8527 0.9541 Gameselo/STS-multilingual-mpnet-base-v2
GreekLegalCodeClassification 0.5433 0.4376 0.3713 0.8052 Bytedance/Seed1.6-embedding-1215
GujaratiNewsClassification 0.9079 0.9205 0.7674 0.9343 Bytedance/Seed1.6-embedding-1215
HALClusteringS2S.v2 0.3089 0.3200 0.2261 0.3228 Qwen/Qwen3-Embedding-8B
HC3FinanceRetrieval 0.7395 0.7758 nan 0.8242 nvidia/NV-Embed-v2
HagridRetrieval 0.9875 0.9931 0.9891 0.9931 google/gemini-embedding-001
HumanEvalRetrieval 0.9977 0.9910 nan 0.9945 voyageai/voyage-3-large
IN22GenBitextMining 0.8159 0.9375 0.7675 0.9375 google/gemini-embedding-001
IndicCrosslingualSTS 0.6188 0.6287 0.4387 0.8477 Gameselo/STS-multilingual-mpnet-base-v2
IndicGenBenchFloresBitextMining 0.9413 0.9677 0.8875 0.9881 Sailesh97/Hinvec
IndicLangClassification 0.3076 0.8769 0.2025 0.9930 Bytedance/Seed1.6-embedding-1215
IndonesianIdClickbaitClassification 0.6003 0.6700 0.6122 0.7560 nvidia/llama-embed-nemotron-8b
IsiZuluNewsClassification 0.2771 0.4053 0.3241 0.4053 google/gemini-embedding-001
ItaCaseholdClassification 0.7163 0.7330 0.6679 0.9439 bigscience/sgpt-bloom-7b1-msmarco
JSICK 0.8963 0.8499 0.7983 0.8938 Qwen/Qwen3-Embedding-8B
KorHateSpeechMLClassification 0.1162 0.1769 0.1049 0.7625 Bytedance/Seed1.6-embedding-1215
KorSarcasmClassification 0.5968 0.6051 0.5679 0.6479 tencent/KaLM-Embedding-Gemma3-12B-2511
KurdishSentimentClassification 0.7860 0.8639 0.7708 0.9403 Bytedance/Seed1.6-embedding-1215
LEMBPasskeyRetrieval 0.8900 0.3850 0.3825 1.0000 tencent/KaLM-Embedding-Gemma3-12B-2511
LegalBenchCorporateLobbying 0.9549 0.9598 0.8972 0.9696 voyageai/voyage-3-large
LegalQuAD 0.7174 0.6553 0.4317 0.7675 bm25s
LegalSummarization 0.7653 0.7122 0.621 0.7921 voyageai/voyage-3.5
MBPPRetrieval 0.9243 0.9416 nan 0.9416 google/gemini-embedding-001
MIRACLRetrievalHardNegatives 0.6702 0.7042 0.6675 0.7305 nvidia/llama-embed-nemotron-8b
MLQARetrieval 0.8127 0.8416 0.7566 0.8416 google/gemini-embedding-001
MacedonianTweetSentimentClassification 0.6850 0.7183 0.6192 0.7547 Qwen/Qwen3-Embedding-4B
MalteseNewsClassification 0.3646 0.3738 0.2533 0.6938 Bytedance/Seed1.6-embedding-1215
MasakhaNEWSClassification 0.8316 0.8355 0.7754 0.9009 Bytedance/Seed1.6-embedding-1215
MasakhaNEWSClusteringS2S 0.5670 0.5745 0.3804 0.7365 Bytedance/Seed1.6-embedding-1215
MassiveIntentClassification 0.7889 0.8192 0.674 0.9194 voyageai/voyage-3-m-exp
MedrxivClusteringP2P.v2 0.4564 0.4716 0.3515 0.7199 codefuse-ai/F2LLM-4B
MultiEURLEXMultilabelClassification 0.0449 0.0528 0.0516 0.0968 Bytedance/Seed1.6-embedding-1215
MultiHateClassification 0.7533 0.7247 0.6357 0.8374 tencent/KaLM-Embedding-Gemma3-12B-2511
NTREXBitextMining 0.8889 0.9364 0.914 0.9456 tencent/KaLM-Embedding-Gemma3-12B-2511
NepaliNewsClassification 0.9727 0.9814 0.8847 0.9817 tencent/KaLM-Embedding-Gemma3-12B-2511
News21InstructionRetrieval 0.0566 0.1026 -0.0006 0.1145 google/embeddinggemma-300m
NollySentiBitextMining 0.6380 0.6871 0.675 0.8083 nvidia/llama-embed-nemotron-8b
NordicLangClassification 0.9192 0.8597 0.8015 0.9384 tencent/KaLM-Embedding-Gemma3-12B-2511
NorwegianCourtsBitextMining 0.9357 0.9342 0.9404 0.9447 OrdalieTech/Solon-embeddings-large-0.1
NusaParagraphEmotionClassification 0.5584 0.5638 0.4166 0.8374 Bytedance/Seed1.6-embedding-1215
NusaTranslationBitextMining 0.9133 0.7752 0.672 0.9222 Qwen/Qwen3-Embedding-8B
NusaX-senti 0.7068 0.8031 0.7055 0.8482 Bytedance/Seed1.6-embedding-1215
NusaXBitextMining 0.8780 0.8252 0.7267 0.9056 Bytedance/Seed1.6-embedding-1215
OdiaNewsClassification 0.9179 0.9184 0.8001 0.9715 Bytedance/Seed1.6-embedding-1215
OpusparcusPC 0.9617 0.9662 0.948 1.0000 BAAI/bge-multilingual-gemma2
PAC 0.6391 0.7168 0.7033 0.8811 Bytedance/Seed1.6-embedding-1215
PawsXPairClassification 0.7281 0.5999 0.5514 0.7557 Bytedance/Seed1.6-embedding-1215
PlscClusteringP2P.v2 0.7416 0.7431 0.7161 0.7542 tencent/KaLM-Embedding-Gemma3-12B-2511
PoemSentimentClassification 0.5492 0.5966 0.5067 0.8642 Bytedance/Seed1.6-embedding-1215
PolEmo2.0-OUT 0.5510 0.7753 0.5348 0.8006 nvidia/llama-embed-nemotron-8b
PpcPC 0.9463 0.9550 0.9218 0.9554 tencent/KaLM-Embedding-Gemma3-12B-2511
PunjabiNewsClassification 0.8261 0.8261 0.807 0.8879 Bytedance/Seed1.6-embedding-1215
RTE3 0.9053 0.8955 0.8752 0.9173 Bytedance/Seed1.6-embedding-1215
Robust04InstructionRetrieval 0.0924 -0.0241 -0.0748 0.1372 jhu-clsp/FollowIR-7B
RomaniBibleClustering 0.4238 0.4322 0.4092 0.4589 tencent/KaLM-Embedding-Gemma3-12B-2511
RuBQReranking 0.7688 0.7384 0.756 0.8051 ai-sage/Giga-Embeddings-instruct
SCIDOCS 0.3181 0.2515 0.1747 0.5986 IEITYuan/Yuan-embedding-2.0-en
SIB200ClusteringS2S 0.4641 0.4174 0.4115 0.5126 sbintuitions/sarashina-embedding-v2-1b
SICK-R 0.8816 0.8275 0.8023 0.9465 Gameselo/STS-multilingual-mpnet-base-v2
STS12 0.8624 0.8155 0.8002 0.9546 Gameselo/STS-multilingual-mpnet-base-v2
STS13 0.9400 0.8989 0.8155 0.9776 Gameselo/STS-multilingual-mpnet-base-v2
STS14 0.9052 0.8541 0.7772 0.9753 Gameselo/STS-multilingual-mpnet-base-v2
STS15 0.9382 0.9044 0.8931 0.9811 Gameselo/STS-multilingual-mpnet-base-v2
STS17 0.9179 0.8858 0.8215 0.9342 infgrad/Jasper-Token-Compression-600M
STS22.v2 0.7395 0.7169 0.643 0.7718 Kingsoft-LLM/QZhou-Embedding
STSB 0.8630 0.8550 0.8236 0.9199 Gameselo/STS-multilingual-mpnet-base-v2
STSBenchmark 0.9361 0.8908 0.8729 0.9504 Kingsoft-LLM/QZhou-Embedding
STSES 0.7481 0.8175 0.8021 0.8231 google/embeddinggemma-300m
ScalaClassification 0.5645 0.5185 0.5157 0.8626 tencent/KaLM-Embedding-Gemma3-12B-2511
SemRel24STS 0.6437 0.7314 0.6266 0.8112 VPLabs/SearchMap_Preview
SentimentAnalysisHindi 0.6230 0.7606 0.642 0.8001 Qwen/Qwen3-Embedding-8B
SinhalaNewsClassification 0.7599 0.8229 0.6682 0.8547 tencent/KaLM-Embedding-Gemma3-12B-2511
SiswatiNewsClassification 0.5913 0.6238 0.535 0.7837 Lajavaness/bilingual-embedding-small
SlovakMovieReviewSentimentClassification 0.8994 0.9035 0.7441 0.9539 Bytedance/Seed1.6-embedding-1215
SpartQA 0.1590 0.1030 0.0565 0.8483 tencent/KaLM-Embedding-Gemma3-12B-2511
SprintDuplicateQuestions 0.9562 0.9690 0.9318 0.9838 Kingsoft-LLM/QZhou-Embedding
StackExchangeClustering.v2 0.7855 0.9207 0.4643 0.9207 google/gemini-embedding-001
StackOverflowQA 0.9450 0.9671 0.8889 0.9720 Bytedance/Seed1.6-embedding-1215
StatcanDialogueDatasetRetrieval 0.4856 0.5111 0.1063 0.5807 jinaai/jina-embeddings-v4
SwahiliNewsClassification 0.6562 0.6605 0.5969 0.6753 Qwen/Qwen3-Embedding-8B
SwednClusteringP2P 0.5779 0.4584 0.3691 0.6213 Qwen/Qwen3-Embedding-4B
SwissJudgementClassification 0.5952 0.5786 0.5362 0.7791 Bytedance/Seed1.6-embedding-1215
T2Reranking 0.6714 0.6795 0.6632 0.7315 tencent/Youtu-Embedding
TERRa 0.6641 0.6392 0.5842 0.7957 ai-sage/Giga-Embeddings-instruct
TRECCOVID 0.9125 0.8631 0.7133 0.9833 IEITYuan/Yuan-embedding-2.0-en
Tatoeba 0.7888 0.8197 0.7574 0.9394 OrlikB/KartonBERT-USE-base-v1
TempReasonL1 0.0186 0.0296 0.0114 0.0805 nvidia/llama-embed-nemotron-8b
ToxicConversationsClassification 0.9148 0.8875 0.7132 0.9759 voyageai/voyage-3-m-exp
TswanaNewsClassification 0.3979 0.5337 0.47 0.6417 Bytedance/Seed1.6-embedding-1215
TweetTopicSingleClassification 0.7643 0.7111 0.6532 0.8561 Bytedance/Seed1.6-embedding-1215
TwitterHjerneRetrieval 0.8103 0.9802 0.3522 0.9802 google/gemini-embedding-001
TwitterURLCorpus 0.8655 0.8705 0.8589 0.9571 TencentBAC/Conan-embedding-v2
VoyageMMarcoReranking 0.6800 0.6673 0.6821 0.7351 jinaai/jina-reranker-v3
WebLINXCandidatesReranking 0.1741 0.1097 0.0778 0.1792 Bytedance/Seed1.6-embedding-1215
WikiCitiesClustering 0.8110 0.9163 0.755 0.9357 Qwen/Qwen3-Embedding-4B
WikiClusteringP2P.v2 0.3223 0.2823 0.256 0.3295 tencent/KaLM-Embedding-Gemma3-12B-2511
WikiSQLRetrieval 0.9885 0.8814 nan 0.9608 jinaai/jina-embeddings-v4
WikipediaRerankingMultilingual 0.9100 0.9224 0.8981 0.9308 jinaai/jina-reranker-v3
WikipediaRetrievalMultilingual 0.9225 0.9420 0.9111 0.9420 google/gemini-embedding-001
WinoGrande 0.5709 0.6052 0.5498 0.8989 tencent/KaLM-Embedding-Gemma3-12B-2511
XNLI 0.8595 0.8526 0.7477 0.9291 Bytedance/Seed1.6-embedding-1215
indonli 0.6424 0.6069 0.5174 0.6722 Bytedance/Seed1.6-embedding-1215
Average 0.6883 0.6891 0.5834 0.7911 nan

Model have high performance on these tasks: HumanEvalRetrieval,WikiSQLRetrieval,FinanceBenchRetrieval,AILAStatutes,JSICK,AlloprofReranking,DS1000Retrieval,AILACasedocs,FreshStackRetrieval,FinParaSTS


@bflhc
Copy link
Contributor Author

bflhc commented Dec 23, 2025

Octen-Embedding-8B is optimized for retrieval tasks. It is trained on Qwen3-Embedding-8B using a large amount of real-world industry search data, combined with high-quality synthetic data.

From the results, we can see significant improvements over the base model on both RTEB and MTEB reranking/retrieval tasks. We believe this model could bring substantial value to the open-source community.

We would also appreciate it if you could help run the RTEB private evaluation, so that we can more comprehensively assess the model’s performance.

Regarding configuration, a batch_size of 4 or 2 should work well. Setting corpus_chunk_size to 5000 can also be helpful for datasets with a very large number of queries.

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run Aila tasks and can reproduce results

@Samoed Samoed merged commit 8f20d89 into embeddings-benchmark:main Dec 24, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants