feat: evaluate `openai` models on the remaining `MTEB(Medical)` tasks by dbuades · Pull Request #71 · embeddings-benchmark/results

dbuades · 2024-12-13T23:58:56Z

Following up on this PR, this PR adds results from evaluating text-embedding-3-small and text-embedding-3-large on the remaining tasks in the MTEB(Medical) benchmark.

As discussed here, results from revision 1 are equivalent to those from revision 2. Therefore, we only evaluated tasks that were not previously run.

Thank you @Muennighoff for providing an API key with credits!

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the results files checker make pre-push.

KennethEnevoldsen

Looks good!

I just noted that we track co2 for APIs, which seems a bit misleading. Not sure what the best solution is here (my guess would be that we just filter it out in the leaderboard)

dbuades · 2024-12-14T00:05:21Z

@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think?

Muennighoff · 2024-12-14T00:05:55Z

@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think?

Feel free to run all of them if you want :)

dbuades · 2024-12-14T00:06:57Z

Looks good!

I just noted that we track co2 for APIs, which seems a bit misleading. Not sure what the best solution is here (my guess would be that we just filter it out in the leaderboard)

That was fast!

dbuades · 2024-12-14T00:08:43Z

@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think?

Feel free to run all of them if you want :)

Thanks! I'll run all the common dimensions then (256, 512, 768, 1024)

KennethEnevoldsen · 2024-12-14T00:16:19Z

That was fast!

caught me at boring conference talk ;)

KennethEnevoldsen · 2024-12-14T00:17:59Z

Re running the dimensions: I just want to refer to embeddings-benchmark/mteb#1211

dbuades · 2024-12-14T00:18:34Z

That was fast!

caught me at boring conference talk ;)

By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers. If so, we could override the co2_tracker parameter to False. Somewhere around here.

KennethEnevoldsen · 2024-12-14T00:21:13Z

By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers

But I am not sure we know at that time (the encoder does not contain the metadata)

dbuades · 2024-12-14T00:23:29Z

Re running the dimensions: I just want to refer to embeddings-benchmark/mteb#1211

I hadn't seen this, thanks. I intended to simply create a PR in mteb doing something like:

text_embedding_3_small_512 = ModelMeta(
    name="text-embedding-3-small-512",
    revision="1",
    release_date="2024-01-25",
    languages=None,  # supported languages not specified
    loader=partial(
        OpenAIWrapper,
        model_name="text-embedding-3-small",
        tokenizer_name="cl100k_base",
        max_tokens=8192,
        embed_dim=512,
    ),
    max_tokens=8191,
    embed_dim=512,
    open_weights=False,
    n_parameters=None,
    memory_usage=None,
    license=None,
    reference="https://openai.com/index/new-embedding-models-and-api-updates/",
    similarity_fn_name="cosine",
    framework=["API"],
    use_instructions=False,
)

text_embedding_3_small_768 = ModelMeta(
    name="text-embedding-3-small-768",
    revision="1",
    release_date="2024-01-25",
    languages=None,  # supported languages not specified
    loader=partial(
        OpenAIWrapper,
        model_name="text-embedding-3-small",
        tokenizer_name="cl100k_base",
        max_tokens=8192,
        embed_dim=768,
    ),
    max_tokens=8191,
    embed_dim=768,
    open_weights=False,
    n_parameters=None,
    memory_usage=None,
    license=None,
    reference="https://openai.com/index/new-embedding-models-and-api-updates/",
    similarity_fn_name="cosine",
    framework=["API"],
    use_instructions=False,
)

but the experiments approach is more flexible.

KennethEnevoldsen · 2024-12-14T00:29:15Z

I think this is perfectly file for running them, but I don't think we would accept the PR as we have been working on removing duplicates on the new leaderboard and this would add them again.

(would love to have them run though, then we have start experimenting with how to best display it)

dbuades · 2024-12-14T00:35:23Z

By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers

But I am not sure we know at that time (the encoder does not contain the metadata)

I agree that the encoder does not contain the metadata, but I believe mteb.run() does since it receives the model as an argument. We could check the model meta and if the framework is API, then disable co2_tracker.

Maybe I'm not explaining myself clearly, I can open a small PR to illustrate it in the coming days.

dbuades · 2024-12-14T00:38:27Z

I think this is perfectly file for running them, but I don't think we would accept the PR as we have been working on removing duplicates on the new leaderboard and this would add them again.

(would love to have them run though, then we have start experimenting with how to best display it)

Fair enough! I'll run the evaluations for now and open a draft PR here (they would still show up on different folders) but not on mteb.

KennethEnevoldsen · 2024-12-14T00:40:28Z

Would love a PR - should make it clear

dbuades · 2024-12-14T23:12:43Z

Would love a PR - should make it clear

Perfect, I'll open one.

feat: evaluate openai models on missing MTEB(Medical) tasks

fa5443c

KennethEnevoldsen approved these changes Dec 14, 2024

View reviewed changes

KennethEnevoldsen enabled auto-merge (squash) December 14, 2024 00:03

KennethEnevoldsen merged commit 8f3c2a3 into embeddings-benchmark:main Dec 14, 2024

dbuades deleted the feat/medical-mteb-openai branch December 14, 2024 00:11

This was referenced Dec 20, 2024

feat: experiment with different dimensions for OpenAI models on MTEB(Medical) #78

Closed

fix: disable co2_tracker for API models embeddings-benchmark/mteb#1614

Merged

Comments

Conversation

dbuades commented Dec 13, 2024

Checklist

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

Muennighoff commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

KennethEnevoldsen commented Dec 14, 2024

Uh oh!

KennethEnevoldsen commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

KennethEnevoldsen commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

KennethEnevoldsen commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

KennethEnevoldsen commented Dec 14, 2024

Uh oh!

dbuades commented Dec 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants