Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Replacement of Open-source JudgeModel/Evaluator #49

Open
bittersweet1999 opened this issue Oct 25, 2024 · 1 comment
Open

The Replacement of Open-source JudgeModel/Evaluator #49

bittersweet1999 opened this issue Oct 25, 2024 · 1 comment

Comments

@bittersweet1999
Copy link

Hi ArenaHard Team 👋,

We have updated a series of Fine-Tuned Judge Models:
opencompass/CompassJudger-1-32B-Instruct
opencompass/CompassJudger-1-14B-Instruct
opencompass/CompassJudger-1-7B-Instruct
opencompass/CompassJudger-1-1.5B-Instruct

In relevant experiments, we tested the reliability of these models as Judgers. Among them, CJ-1-14B and 32B achieved a Judge correlation of over 95% with GPT4o-0806 on the ArenaHard dataset, which can be considered as a low-cost alternative to GPT4o. For more details, please refer to our paper. https://arxiv.org/pdf/2410.16256

Thank you!

@bittersweet1999 bittersweet1999 changed the title The Leaderboard of the Open-source JudgeModel/Evaluator The Replacement of Open-source JudgeModel/Evaluator Oct 25, 2024
@CodingWithTim
Copy link
Collaborator

CodingWithTim commented Nov 6, 2024

Hey OpenCompass,

@bittersweet1999 Great work guys! Would love to improve the judge. I couldn't find what were the set of models used to calculate the correlation? Could you share the details on that? The correlation is very model dependent.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants