feat: Integrating ChemTEB#1708
Conversation
There was a problem hiding this comment.
Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?
mteb/tasks/PairClassification/eng/PubChemAISentenceParaphrasePC.py
Outdated
Show resolved
Hide resolved
Hi, thank you for the review. I haven't worked with the mentioned repository before. Could you clarify how I should proceed? What do you mean by "paper"? |
|
You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1 |
The detailed score for each task-model pair is not reported in our paper (I have it locally though); instead, an average per category is provided. Additionally, the current PR does not include the exact same tasks: three tasks have been removed, one new multilingual task has been added, and eight tasks have been merged into two. Therefore, the average scores may not align perfectly. However, I can run the benchmark and push the scores to the results repository. |
- Text should be truncated for amazon text embedding models. - `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens. - Two datasets are updated, dropping very long samples (len > 99th percentile)
|
I updated Error Example: ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation:
400 Bad Request: Too many input tokens. Max input tokens: 8192, request input token count: 11530
To handle long text, there are two approaches for truncation:Static Approach Dynamic Approach try:
all_embeddings.append(self._embed_amazon(sentence))
except ValidationError as e:
pattern = "request input token count:" + r"\s*(\d+)"
match = re.search(pattern, str(e))
if match:
num_tokens = int(match.group(1))
ratio = 0.9 * (self._max_tokens / num_tokens)
max_sequence_length = int(len(sentence) * ratio)
all_embeddings.append(self._embed_amazon(sentence[:max_sequence_length]))
else:
raise eI used the first method, but if you think the second one is better, let me know. |
|
While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the from openai import OpenAI
import tiktoken
import numpy as np
client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")
sn = 'Hello World' * 5000
print(f'Num tokens: {len(encoding.encode(sn))}')
truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)
response = client.embeddings.create(
input=truncated_sentence,
model="text-embedding-ada-002",
encoding_format="float",
)
em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}') |
|
Yes, we have some inconsistency with For Bedrock token count, I suggest adding a static calculation for the number of tokens along with a dynamic approach to ensure the models work correctly. Additionally, could you create a table here comparing the results from the paper with your results? |
Originally posted by @HSILA in embeddings-benchmark/results#89 (comment) |
|
@HSILA thanks again for your contributions! Merging now. |
As discussed in issue #1585, I have integrated ChemTEB into the MTEB codebase and submitted this PR for merging. I have added 28 tasks and two model families:
amazon_models.pyandcohere_bedrock_models.py. These models were included to meet our internal requirements, as they are not currently available in MTEB. I thought it might be good to have them, but I can remove them if necessary. I have tested the mentioned models with all the tasks, you can see the result here: chemteb_results_pr.csvCloses #1585
Checklist
make test.make lint.docs/benchmarks.mdanddocs/tasks.mdChemTEBas a benchmark inbenchmark.pyAdding datasets checklist
Reason for dataset addition: To address the lack of benchmarks tailored to chemical sciences, where text involves complex entity names, expressions, and specialized representations like SMILES codes, not typically found in general domain datasets.
mteb -m {model_name} -t {task_name}command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2intfloat/multilingual-e5-smallself.stratified_subsampling() under dataset_transform()make test.make lint.Adding a model checklist
mteb.get_model(model_name, revision)andmteb.get_model_meta(model_name, revision)