The Replacement of Open-source JudgeModel/Evaluator #49

bittersweet1999 · 2024-10-25T08:43:06Z

Hi ArenaHard Team 👋,

We have updated a series of Fine-Tuned Judge Models:
opencompass/CompassJudger-1-32B-Instruct
opencompass/CompassJudger-1-14B-Instruct
opencompass/CompassJudger-1-7B-Instruct
opencompass/CompassJudger-1-1.5B-Instruct

In relevant experiments, we tested the reliability of these models as Judgers. Among them, CJ-1-14B and 32B achieved a Judge correlation of over 95% with GPT4o-0806 on the ArenaHard dataset, which can be considered as a low-cost alternative to GPT4o. For more details, please refer to our paper. https://arxiv.org/pdf/2410.16256

Thank you!

CodingWithTim · 2024-11-06T23:42:23Z

Hey OpenCompass,

@bittersweet1999 Great work guys! Would love to improve the judge. I couldn't find what were the set of models used to calculate the correlation? Could you share the details on that? The correlation is very model dependent.

Thanks.

bittersweet1999 changed the title ~~The Leaderboard of the Open-source JudgeModel/Evaluator~~ The Replacement of Open-source JudgeModel/Evaluator Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Replacement of Open-source JudgeModel/Evaluator #49

The Replacement of Open-source JudgeModel/Evaluator #49

bittersweet1999 commented Oct 25, 2024

CodingWithTim commented Nov 6, 2024 •

edited

Loading

The Replacement of Open-source JudgeModel/Evaluator #49

The Replacement of Open-source JudgeModel/Evaluator #49

Comments

bittersweet1999 commented Oct 25, 2024

CodingWithTim commented Nov 6, 2024 • edited Loading

CodingWithTim commented Nov 6, 2024 •

edited

Loading