You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In relevant experiments, we tested the reliability of these models as Judgers. Among them, CJ-1-14B and 32B achieved a Judge correlation of over 95% with GPT4o-0806 on the ArenaHard dataset, which can be considered as a low-cost alternative to GPT4o. For more details, please refer to our paper. https://arxiv.org/pdf/2410.16256
Thank you!
The text was updated successfully, but these errors were encountered:
bittersweet1999
changed the title
The Leaderboard of the Open-source JudgeModel/Evaluator
The Replacement of Open-source JudgeModel/Evaluator
Oct 25, 2024
@bittersweet1999 Great work guys! Would love to improve the judge. I couldn't find what were the set of models used to calculate the correlation? Could you share the details on that? The correlation is very model dependent.
Hi ArenaHard Team 👋,
We have updated a series of Fine-Tuned Judge Models:
opencompass/CompassJudger-1-32B-Instruct
opencompass/CompassJudger-1-14B-Instruct
opencompass/CompassJudger-1-7B-Instruct
opencompass/CompassJudger-1-1.5B-Instruct
In relevant experiments, we tested the reliability of these models as Judgers. Among them, CJ-1-14B and 32B achieved a Judge correlation of over 95% with GPT4o-0806 on the ArenaHard dataset, which can be considered as a low-cost alternative to GPT4o. For more details, please refer to our paper. https://arxiv.org/pdf/2410.16256
Thank you!
The text was updated successfully, but these errors were encountered: