You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- lazy load of openai and transformers lib
- able to use openai client to prompt transfrmers models
- add llama 3.1 405B as llm as a judge in default metrics
- update doc
- make sure we are not rate limited
- add option to use local transformer model as judge
---------
Co-authored-by: anilaltuner <[email protected]>
Co-authored-by: Clémentine Fourrier <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -415,6 +415,11 @@ These metrics need the model to generate an output. They are therefore slower.
415
415
-`maj_at_4_math` (Lighteval): Majority choice evaluation, using the math normalisation for the predictions and gold
416
416
-`quasi_exact_match_gsm8k` (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
417
417
-`maj_at_8_gsm8k` (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
418
+
- LLM-as-Judge:
419
+
-`llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API
420
+
-`llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API
421
+
-`llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API. It is used for multiturn tasks like mt-bench.
422
+
-`llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API. It is used for multiturn tasks like mt-bench.
418
423
419
424
### Metrics for specific tasks
420
425
To keep compatibility with the Harness for some specific tasks, we ported their evaluations more or less as such. They include `drop` (for the DROP dataset) and `truthfulqa_mc_metrics` (for TruthfulQA). In general, except for tasks where the dataset has very different formatting than usual (another language, programming language, math, ...), we want to use standard implementations of the above metrics. It makes little sense to have 10 different versions of an exact match depending on the task. However, most of the above metrics are parametrizable so that you can change the normalization applied easily for experimental purposes.
0 commit comments