feat: Integrating ChemTEB by HSILA · Pull Request #1708 · embeddings-benchmark/mteb

HSILA · 2025-01-05T03:51:56Z

As discussed in issue #1585, I have integrated ChemTEB into the MTEB codebase and submitted this PR for merging. I have added 28 tasks and two model families: amazon_models.py and cohere_bedrock_models.py. These models were included to meet our internal requirements, as they are not currently available in MTEB. I thought it might be good to have them, but I can remove them if necessary. I have tested the mentioned models with all the tasks, you can see the result here: chemteb_results_pr.csv

Closes #1585

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
Updating tables in docs/benchmarks.md and docs/tasks.md
Adding ChemTEB as a benchmark in benchmark.py

Adding datasets checklist

Reason for dataset addition: To address the lack of benchmarks tailored to chemical sciences, where text involves complex entity names, expressions, and specialized representations like SMILES codes, not typically found in general domain datasets.

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using mteb.get_model(model_name, revision) and mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

…ion Tasks

…beddings-benchmark#1115 in original repo

…nsformer.

Samoed

Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?

mteb/models/cohere_bedrock_models.py

mteb/tasks/BitextMining/eng/PubChemSMILESBitextMining.py

mteb/tasks/PairClassification/eng/PubChemAISentenceParaphrasePC.py

mteb/tasks/PairClassification/eng/PubChemSynonymPC.py

mteb/tasks/PairClassification/eng/PubChemWikiParagraphsPC.py

HSILA · 2025-01-05T16:41:05Z

Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?

Hi, thank you for the review. I haven't worked with the mentioned repository before. Could you clarify how I should proceed? What do you mean by "paper"?

Samoed · 2025-01-05T18:06:59Z

You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1

HSILA · 2025-01-05T18:38:18Z

You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1

The detailed score for each task-model pair is not reported in our paper (I have it locally though); instead, an average per category is provided. Additionally, the current PR does not include the exact same tasks: three tasks have been removed, one new multilingual task has been added, and eight tasks have been merged into two. Therefore, the average scores may not align perfectly. However, I can run the benchmark and push the scores to the results repository.

…dels.py`

- Text should be truncated for amazon text embedding models. - `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens. - Two datasets are updated, dropping very long samples (len > 99th percentile)

HSILA · 2025-01-08T20:51:07Z

I updated bedrock_models.py to truncate sentences if they exceed Amazon's context length. For Cohere models in Bedrock, the limit is exactly 2048 characters, while for Amazon models, it is 8192 tokens. However, there is no straightforward way to calculate the number of tokens for a piece of text (like tiktoken does for OpenAI models) unless you send a long text to the API and it returns an error with the exact token count :)

Error Example:

ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: 
400 Bad Request: Too many input tokens. Max input tokens: 8192, request input token count: 11530
To handle long text, there are two approaches for truncation:

Static Approach
Use a smaller, safe character-to-token ratio. According to Amazon's documentation:
The characters-to-token ratio in English is 4.7 characters per token, on average.
A conservative ratio, like 45, can be used.

Dynamic Approach
Wrap the logic in a try-catch block, let the API throw an error for long text, and then adjust based on the exact token count (maybe I'm overthinking though)

try:
    all_embeddings.append(self._embed_amazon(sentence))
except ValidationError as e:
    pattern = "request input token count:" + r"\s*(\d+)"
    match = re.search(pattern, str(e))
    if match:
        num_tokens = int(match.group(1))
        ratio = 0.9 * (self._max_tokens / num_tokens)
        max_sequence_length = int(len(sentence) * ratio)
        all_embeddings.append(self._embed_amazon(sentence[:max_sequence_length]))
    else:
        raise e

I used the first method, but if you think the second one is better, let me know.

HSILA · 2025-01-08T20:55:04Z

While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the text-embedding-ada-002 model actually has a context length of 8191. If you push it to 8192 tokens, the API doesn’t throw an error (probably to align with the text-embedding-3-* models), but the model sometimes returns null values instead. I think we should update openai_models.py to fix this. Here’s a script to reproduce the issue, change 8191 to 8192 and see the results.

from openai import OpenAI
import tiktoken
import numpy as np

client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")

sn = 'Hello World' * 5000

print(f'Num tokens: {len(encoding.encode(sn))}')

truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)

response = client.embeddings.create(
    input=truncated_sentence,
    model="text-embedding-ada-002",
    encoding_format="float",
)

em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}')

Samoed · 2025-01-09T07:15:12Z

Yes, we have some inconsistency with max_tokens in OpenAI models. In ModelMeta, it states max_tokens=8191, but in the class, it is set to 8192. I think you could create a separate PR to fix this.

For Bedrock token count, I suggest adding a static calculation for the number of tokens along with a dynamic approach to ensure the models work correctly.

Additionally, could you create a table here comparing the results from the paper with your results?

…s.py`

Samoed

Great!

mteb/tasks/PairClassification/eng/PubChemSMILESPC.py

Samoed · 2025-01-12T07:12:59Z

Hi, hope you are doing fine.

As I mentioned earlier here, a direct comparison with the paper is not possible because the task-model scores are not presented in the paper. The paper reports an average score per category, but the combination of tasks has changed in my PR, so we cannot directly compare per-category averages.

However, one approach we can take is to compare the tasks that are present both in my PR and in the ChemTEB results. I have shared the local JSON results that I have, allowing us to compare the shared tasks and evaluate how the performance has changed on average.

To ensure that the JSON files I shared in chemteb-results correspond to the same results used to produce Table 2, you can refer to table2.ipynb and reproduce it for verification.

The mteb.ipynb notebook compares the main score for shared tasks (using an average score for tasks that were merged as subsets of a bigger task in MTEB) and reports the difference.

The observed difference is 0.0045 overall (average across all the models and tasks), and it ranges from approximately 0 to 0.02 for most tasks. Notably, the PubChemWikiParagraphsPC task shows a significant difference because it was updated later. This update involved masking exact chemical compound names in each text pair to make the problem more challenging.

Other observed changes, particularly in Classification and Clustering tasks, can be attributed to updates that complemented the label column (which was not sufficiently descriptive) with a label_text column. Additionally, these tasks may have undergone reordering, potentially affecting train-test splits.

That said, we can always revert all revisions to match those used in the paper. However, the current revisions in the PR provide more detailed information, such as the label_text column and an updated README.

Originally posted by @HSILA in embeddings-benchmark/results#89 (comment)

isaac-chung

Hey @HSILA, thank you for this effort! And thanks to @Samoed for your feedback. I've only got one last suggestion on this PR. Otherwise it looks great, and I'm very glad that this benchmark is being integrated 🥳

mteb/models/bedrock_models.py

isaac-chung · 2025-01-25T03:50:38Z

@HSILA thanks again for your contributions! Merging now.

HSILA and others added 30 commits August 10, 2024 17:47

Add SMILES, AI Paraphrase and Inter-Source Paragraphs PairClassificat…

dfa6f84

…ion Tasks

Merge branch 'embeddings-benchmark:main' into chemteb

d0d94db

Merge branch 'embeddings-benchmark:main' into chemteb

1190c02

Add chemical subsets of NQ and HotpotQA datasets as Retrieval tasks

b56e017

Add PubChem Synonyms PairClassification task

678dbc9

Update task __init__ for previously added tasks

9c8f7f5

Add nomic-bert loader

5e31208

Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb

20f69a2

Add a script to run the evaluation pipeline for chemical-related tasks

9806073

Merge branch 'embeddings-benchmark:main' into chemteb

09d2fee

Add 15 Wikipedia article classification tasks

947e07a

Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb

842af8e

Add PairClassification and BitextMining tasks for Coconut SMILES

47b550f

Fix naming of some Classification and PairClassification tasks

79d9111

Fix some classification tasks naming issues

17f8be1

Integrate WANDB with benchmarking script

bb77955

Update .gitignore

d287801

Merge branch 'embeddings-benchmark:main' into chemteb

b3a4f72

Merge branch 'embeddings-benchmark:main' into chemteb

0ec882b

Fix nomic_models.py issue with retrieval tasks, similar to issue em…

107fba5

…beddings-benchmark#1115 in original repo

Add one chemical model and some SentenceTransformer models

82aa559

Fix a naming issue for SentenceTransformer models

90c5ecb

Merge branch 'embeddings-benchmark:main' into chemteb

0078b2d

Merge branch 'embeddings-benchmark:main' into chemteb

7fcd3ea

Merge branch 'embeddings-benchmark:main' into chemteb

85b6ba5

Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb

6bc75fc

Add OpenAI, bge-m3 and matscibert models

0c2deda

Add PubChem SMILES Bitext Mining tasks

4e9f309

Change metric namings to be more descriptive

52d1831

Add English e5 and bge v1 models, all the sizes

5c4e501

HSILA added 2 commits January 5, 2025 03:19

Remove nomic_bert_model.py as it is now compatible with SentenceTra…

256d6d8

…nsformer.

Remove WikipediaAIParagraphsParaphrasePC task due to being trivial.

3475017

Samoed reviewed Jan 5, 2025

View reviewed changes

HSILA added 2 commits January 5, 2025 20:20

Merge amazon_models and cohere_bedrock_models.py into `bedrock_mo…

da4ef35

…dels.py`

Remove unnecessary load_data for some tasks.

f50cd66

HSILA mentioned this pull request Jan 8, 2025

Add ChemTEB results embeddings-benchmark/results#89

Merged

3 tasks

HSILA and others added 2 commits January 11, 2025 22:32

Add a layer of dynamic truncation for amazon models in `bedrock_model…

7b93330

…s.py`

Merge branch 'embeddings-benchmark:main' into mteb

ead1026

Samoed mentioned this pull request Jan 12, 2025

fix: update max tokens for OpenAI #1772

Merged

2 tasks

Samoed changed the title ~~Integrating ChemTEB~~ feat: Integrating ChemTEB Jan 12, 2025

Samoed approved these changes Jan 12, 2025

View reviewed changes

mteb/tasks/PairClassification/eng/PubChemSMILESPC.py Outdated Show resolved Hide resolved

Samoed requested review from KennethEnevoldsen and isaac-chung January 12, 2025 07:09

HSILA added 3 commits January 12, 2025 14:29

Replace metadata_dict with self.metadata in PubChemSMILESPC.py

064d053

Merge remote-tracking branch 'upstream/main' into mteb

62f99b0

fix model meta for bedrock models

0156440

Samoed requested a review from x-tabdeveloping January 23, 2025 18:02

isaac-chung approved these changes Jan 24, 2025

View reviewed changes

mteb/models/bedrock_models.py Show resolved Hide resolved

Add reference comment to original Cohere API implementation

c630053

isaac-chung merged commit 4d66434 into embeddings-benchmark:main Jan 25, 2025
11 checks passed

HSILA mentioned this pull request Jan 12, 2026

dataset: add ChemRxivRetrieval task to ChemTEB benchmark #3923

Merged

7 tasks

Conversation

HSILA commented Jan 5, 2025 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Adding datasets checklist

Adding a model checklist

Uh oh!

Samoed left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HSILA commented Jan 5, 2025

Uh oh!

Samoed commented Jan 5, 2025

Uh oh!

HSILA commented Jan 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HSILA commented Jan 8, 2025

Uh oh!

HSILA commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 9, 2025

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samoed commented Jan 12, 2025

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaac-chung commented Jan 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

HSILA commented Jan 5, 2025 •

edited by Samoed

Loading

Samoed left a comment •

edited

Loading

HSILA commented Jan 5, 2025 •

edited

Loading

HSILA commented Jan 8, 2025 •

edited

Loading