-
Notifications
You must be signed in to change notification settings - Fork 367
Add new Arabic benchmarks (5) and enhance existing tasks #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options.
Rename file to refelect that it is v1 leaderboard tasks
Tasks for v2 of OALL
add new and renamed tasks
|
Hi, we thanks for adding the benches. Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications |
|
Hey @hynky1999, thanks for your input!
I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?
I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max. Also, I’m unsure why the format check is failing. I ran Thanks! |
|
Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks |
Fix formatting issues for
|
Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :) |
|
Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval. |
|
Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu? |
|
Hey @clefourrier, Unfortunately i was planning to but it slipped through as I forgot to add it to my to-do list ! I can prioritize this over the weekend and get back to you with an update by Monday. Does this timeline work ? I assume there will be no difference (or at least that's what we hope for) lighteval|mmlu_ara_mcf|0|0
lighteval|openai_mmlu_ara_mcf|0|0Let me know if there’s anything else to consider. |
|
Yep, the timeline works on our side! I think these are the correct evals. |
|
Yes these are the correct evals |
|
I'am sorry but i don't understand why i keep hitting this error: My main command: # Run the evaluation command
srun -N $SLURM_NNODES --ntasks=$SLURM_NTASKS --cpus-per-task=$SLURM_CPUS_PER_TASK --gres=gpu:$SLURM_GPUS_PER_NODE \
yes 'y' | lighteval accelerate \
--model_args "pretrained=$model,trust_remote_code=$TRUST_REMOTE_CODE" \
--tasks "lighteval|openai_mmlu_ara_mcf|0|0, lighteval|mmlu_ara_mcf|0|0" \
--override_batch_size 1 \
--output_dir=$RESULTS_DIR
I don't understand where i'm passing the cc: @hynky1999 , @NathanHB , @clefourrier |
|
You wanna run with |
|
Oooh I thought |
Add missing task: OpenAI's MMMLU arabic subset
Correct order
|
Hey @clefourrier, following up on your previous question :
Please find here teh results which i find pretty interesting !
I don't really see here any correlation ! sometimes score is up with community suite implementation, sometimes it goes up with lighteval @hynky1999 implementation |
|
I think there is one consistency: Re chat models, did you run with tempaltes? In any case: For mbuzai mmlu, I am open going with your implementation (since you are the creator), but would maybe prefer a middle ground, because it's still very hard-coded and don't allow switching to CF: Task name in this case is: Think it's a good middle ground between the two: |
|
I noticed this as well:
But I agree that for the sake of consistency, OpenAI MMMLU should be run through multilingual lighteval suite. About this:
I tried it and still running into errors, so i just copy pasted all the tasks in a txt file and run. |
|
Hey @NathanHB, Is it already defined or should i define it ? Anyway, I will try to get it done tomorrow. |
|
it is not yet defined using templated prompt, @hynky1999 provided some code to use templates!
|
|
Hey @hynky1999, I'am wondering, which normalization is used exactly in metric=get_metrics_for_formulation(
formulation,
[
loglikelihood_acc_metric(normalization=LogProbTokenNorm()),
loglikelihood_acc_metric(normalization=LogProbCharNorm()),
loglikelihood_acc_metric(normalization=LogProbPMINorm()),
],
),For example in our |
|
Update on running [rank0]: raise ValueError(f"Cannot find tasks {task_name} in task list or in custom task registry)")
[rank0]: ValueError: Cannot find tasks lighteval|templated_mmlu_ara_mcf:accounting_university in task list or in custom task registry)The task is defined in |
|
Because it's not there! You have to create it, or give me rights to push to your branch |
Yes, the explanation is simple:
Lastly why bother with changing normalizations, if you normalize with chars/tokens for mcf you will get same results because targerts are just single token right? Well yes, but we also use PMI normalization for some tasks, which makes evals 2x more expensive (you have to run two logprob calls for single sample), so that's why we bother with chaning the norms for mcf. PS: I have noticed a bug yesterday, and the produced metric is called |
Indeed i did define it in my local |
|
🤔 it's tough to tell, can you push it ? |
|
cc @alielfilali01 could you push the changes so I can check what's wrong? |
Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR
|
Hey @hynky1999, so sorry for delay ! I was planning to after seeing your first comment but i got carried away with other things and totally slipped out of my mind ! Tnx for the reminder too. |
|
Hey @hynky1999 , Have you managed to take a look yet ? |
|
@alielfilali01 was a bit busy. The issue is that you didn't add it to TASK_TABLE |
|
Ooooh 😶 alright will do tomorrow and get back to you on status |
remove arabic_mmlu_templated_tasks
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Doing one last run of the tests and we should be good to go |
* Update arabic_evals.py Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options. * Update and rename OALL_tasks.txt to OALL_v1_tasks.txt Rename file to refelect that it is v1 leaderboard tasks * Create OALL_v2_tasks.txt Tasks for v2 of OALL * Update all_arabic_tasks.txt add new and renamed tasks * Update arabic_evals.py Fix formatting issues for * Update all_arabic_tasks.txt Add missing task: OpenAI's MMMLU arabic subset * Update all_arabic_tasks.txt Correct order * Update arabic_evals.py remove openai mmmlu task following the discussion here: #372 * Update all_arabic_tasks.txt remove openai mmmlu task following the discussion here: #372 * Update tasks.py Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR * Update tasks.py remove arabic_mmlu_templated_tasks --------- Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: Nathan Habib <[email protected]>
* Update arabic_evals.py Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options. * Update and rename OALL_tasks.txt to OALL_v1_tasks.txt Rename file to refelect that it is v1 leaderboard tasks * Create OALL_v2_tasks.txt Tasks for v2 of OALL * Update all_arabic_tasks.txt add new and renamed tasks * Update arabic_evals.py Fix formatting issues for * Update all_arabic_tasks.txt Add missing task: OpenAI's MMMLU arabic subset * Update all_arabic_tasks.txt Correct order * Update arabic_evals.py remove openai mmmlu task following the discussion here: #372 * Update all_arabic_tasks.txt remove openai mmmlu task following the discussion here: #372 * Update tasks.py Adding a templated version of arabic mmlu based on @hynky1999 request in the #372 PR * Update tasks.py remove arabic_mmlu_templated_tasks --------- Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: Nathan Habib <[email protected]>
Renamed
arabic_mmlutoarabic_mmlu_mt:Introduced three new MMLU-style benchmarks:
arabic_mmlu: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).arabic_mmlu_ht: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).arabic_mmmlu: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.Added AraTrust benchmark:
Added MadinahQA benchmark:
Comparative study across different versions of Arabic MMLU:
arabic_mmlu_mt(machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.arabic_mmlu_okapi), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.
cc : @clefourrier , @NathanHB , @hynky1999