From d0060084dda3ed5592a2ea625d0817181e57efd8 Mon Sep 17 00:00:00 2001 From: Chuan Du Date: Wed, 18 Sep 2024 00:23:26 -0700 Subject: [PATCH 1/3] Clarify API required for running llm-as-judge models. --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 9f47d218c..a5d4ec357 100644 --- a/README.md +++ b/README.md @@ -429,10 +429,10 @@ These metrics need the model to generate an output. They are therefore slower. - `quasi_exact_match_gsm8k` (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed) - `maj_at_8_gsm8k` (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold - LLM-as-Judge: - - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API - - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API - - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API. It is used for multiturn tasks like mt-bench. - - `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API. It is used for multiturn tasks like mt-bench. + - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API + - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API + - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the HuggingFace API. It is used for multiturn tasks like mt-bench. + - `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the OpenAI API. It is used for multiturn tasks like mt-bench. ### Metrics for specific tasks To keep compatibility with the Harness for some specific tasks, we ported their evaluations more or less as such. They include `drop` (for the DROP dataset) and `truthfulqa_mc_metrics` (for TruthfulQA). In general, except for tasks where the dataset has very different formatting than usual (another language, programming language, math, ...), we want to use standard implementations of the above metrics. It makes little sense to have 10 different versions of an exact match depending on the task. However, most of the above metrics are parametrizable so that you can change the normalization applied easily for experimental purposes. From 2f7f469110a7bda5362abed9cf32c83565a8e703 Mon Sep 17 00:00:00 2001 From: Chuan Du Date: Wed, 18 Sep 2024 00:26:36 -0700 Subject: [PATCH 2/3] Fix APIs names that were reversed in the previous commit. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a5d4ec357..d1ecb7050 100644 --- a/README.md +++ b/README.md @@ -431,8 +431,8 @@ These metrics need the model to generate an output. They are therefore slower. - LLM-as-Judge: - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API - - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the HuggingFace API. It is used for multiturn tasks like mt-bench. - - `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the OpenAI API. It is used for multiturn tasks like mt-bench. + - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench. + - `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API. It is used for multiturn tasks like mt-bench. ### Metrics for specific tasks To keep compatibility with the Harness for some specific tasks, we ported their evaluations more or less as such. They include `drop` (for the DROP dataset) and `truthfulqa_mc_metrics` (for TruthfulQA). In general, except for tasks where the dataset has very different formatting than usual (another language, programming language, math, ...), we want to use standard implementations of the above metrics. It makes little sense to have 10 different versions of an exact match depending on the task. However, most of the above metrics are parametrizable so that you can change the normalization applied easily for experimental purposes. From 79f778e9223c0fb42e6a86b6b0c669ff9e3cb7d7 Mon Sep 17 00:00:00 2001 From: Chuan Du Date: Mon, 23 Sep 2024 18:42:47 -0700 Subject: [PATCH 3/3] Fix precommit hook installation command. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d1ecb7050..cafe23d5f 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ and pasting your access token. Lastly, if you intend to push to the code base, you'll need to install the precommit hook for styling tests: ```bash -pip install .[dev] +pip install '.[dev]' pre-commit install ```