feat: evaluate openai models on the remaining MTEB(Medical) tasks#71
Conversation
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Looks good!
I just noted that we track co2 for APIs, which seems a bit misleading. Not sure what the best solution is here (my guess would be that we just filter it out in the leaderboard)
|
@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think? |
Feel free to run all of them if you want :) |
That was fast! |
Thanks! I'll run all the common dimensions then (256, 512, 768, 1024) |
caught me at boring conference talk ;) |
|
Re running the dimensions: I just want to refer to embeddings-benchmark/mteb#1211 |
By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers. If so, we could override the co2_tracker parameter to False. Somewhere around here. |
But I am not sure we know at that time (the encoder does not contain the metadata) |
I hadn't seen this, thanks. I intended to simply create a PR in but the experiments approach is more flexible. |
|
I think this is perfectly file for running them, but I don't think we would accept the PR as we have been working on removing duplicates on the new leaderboard and this would add them again. (would love to have them run though, then we have start experimenting with how to best display it) |
I agree that the encoder does not contain the metadata, but I believe Maybe I'm not explaining myself clearly, I can open a small PR to illustrate it in the coming days. |
Fair enough! I'll run the evaluations for now and open a draft PR here (they would still show up on different folders) but not on |
|
Would love a PR - should make it clear |
Perfect, I'll open one. |
Following up on this PR, this PR adds results from evaluating
text-embedding-3-smallandtext-embedding-3-largeon the remaining tasks in theMTEB(Medical)benchmark.As discussed here, results from revision 1 are equivalent to those from revision 2. Therefore, we only evaluated tasks that were not previously run.
Thank you @Muennighoff for providing an API key with credits!
Checklist
make test.make pre-push.