-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation metric (acc vs. acc_norm) for lm-evaluation-harness tasks #25
Comments
Hi. I used these keys: Besides, I saw the difference in the performance of the original LLaMA-7B. I double-checked the code and used the evaluation code in my repo to re-evaluate the performance of LLaMA-7B, and I got a very similar performance (on ARC-easy, 67.38 vs. 67.45). Does the performance of LLaMA-7B you listed above use the lm-evaluation-harness (a previous commit of lm-evaluation-harness) in my repo? Since the lm-evaluation-harness has changed a lot in these months, some results are not consistent. |
Thank you for your response and clarification! I apologize for the missing commit hash regarding the LLaMA-7B results. Upon your indication, I realized that I used a different hash for lm-evaluation-harness compared to your code.
Thanks for pointing out this point. To ensure clarity, I've updated the table above and will soon provide results using your evaluation code. Thanks again for your assistance. |
Hi, I've added the results using your repo, which are fully reproducible. I also made an explicit note to avoid any confusion. Big thanks for your time and help!
|
Hi😄 . Thank you very much for the detailed notes and the experimental results you contributed based on the new version of the lm-evaluation-harness! |
Hi, thank you very much for generously open-sourcing your excellent work.
I've run the evaluation code you kindly shared and obtained the below results. I have a question regarding the metric for each task. Could you please clarify which one between
acc
oracc_norm
[ref] was used for the PIQA, HellaSwag, ARC-e, ARC-c, and OBQA tasks? Thanks for taking the time to check this inquiry.20%-pruned -> post-trained LLaMA from scripts/llama_prune.sh
acc
acc_norm
acc
acc_norm
Original LLaMA-7B
acc
acc_norm
acc
acc_norm
Note
The text was updated successfully, but these errors were encountered: