diff --git a/docs/evaluation/robustness.md b/docs/evaluation/robustness.md index a06de02842..c4c6a9fee1 100644 --- a/docs/evaluation/robustness.md +++ b/docs/evaluation/robustness.md @@ -70,12 +70,10 @@ The following metrics are calculated: * **Aggregated Benchmark Statistics**: For each benchmark across all prompts and seeds, the script calculates: - `min`, `max`, `avg`, `std`: Statistical metrics across all runs per benchmark. - - `CR` (Consistency Rate): The average rate of agreement of model predictions on the same datapoint across different runs. - `prompt_sensitivity`: The standard deviation of the average scores across different prompts, which measures how sensitive the model's accuracy is to prompt variations. * **Per-Prompt Statistics**: For each prompt across all random seeds, the script calculates: - `min`, `max`, `avg`, `std`: Statistical metrics for a single prompt across seeds. - - `CR` (Consistency Rate): The average rate of agreement of model predictions on the same question across different runs. - `no_answer`: The proportion of questions for which no answer was extracted from the generation, either due to a wrong answer format or no answer at all (can be used to find prompts that break the model predictions). @@ -84,34 +82,29 @@ First, for each benchmark, metrics are aggregated across all prompts and seeds. All calculated metrics are also saved to `output_dir/metrics.json`.
``` -dataset | min | max | avg | std | CR | prompt_sensitivity -------------------------------------------------------------------------------------------- -comp-math-24-25@80 | 48.05 | 53.91 | 51.10 | 1.60 | 55.25 | 0.34 -gpqa@80 | 50.51 | 60.61 | 55.51 | 2.44 | 65.15 | 0.77 +dataset | min | max | avg | std | prompt_sensitivity +---------------------------------------------------------------------------------- +comp-math-24-25@80 | 48.05 | 53.91 | 51.10 | 1.60 | 0.34 +gpqa@80 | 50.51 | 60.61 | 55.51 | 2.44 | 0.77 -------------------------------------- comp-math-24-25 ------------------------------------- -prompt@8 | min | max | avg | std | CR | no_answer +------------------------------------- comp-math-24-25 ---------------------------- +prompt@8 | min | max | avg | std | no_answer ---------------------------------------------------------------------------------- -prompt_1 | 48.05 | 53.91 | 50.76 | 1.61 | 55.48 | 1.56 +prompt_1 | 48.05 | 53.91 | 50.76 | 1.61 | 1.56 ... -prompt_10 | 48.44 | 53.91 | 51.44 | 1.52 | 55.13 | 1.66 +prompt_10 | 48.44 | 53.91 | 51.44 | 1.52 | 1.66 -------------------------------------- gpqa -------------------------------------- -prompt@8 | min | max | avg | std | CR | no_answer +prompt@8 | min | max | avg | std | no_answer ---------------------------------------------------------------------------------- -prompt_1 | 50.51 | 60.61 | 54.73 | 2.68 | 64.20 | 3.03 +prompt_1 | 50.51 | 60.61 | 54.73 | 2.68 | 3.03 ... -prompt_10 | 53.54 | 60.10 | 56.28 | 1.88 | 66.59 | 2.78 +prompt_10 | 53.54 | 60.10 | 56.28 | 1.88 | 2.78 ``` -### Consistency Rate -For each datapoint, collect all predictions and calculate the similarity between all possible pairs of predictions. -The consistency rate is the number of pairs of equivalent prediction pairs divided by the total number of prediction pairs (N choose 2).
-Example: For a datapoint with predictions [A, A, C] across 3 files, it will compare pairs (A, A), (A, C), and (A, C), and the consistency rate will be 1/3 = 33.33%.
-Consistency rate is proposed in [Improving the Robustness of Large Language Models via Consistency Alignment](https://arxiv.org/abs/2403.14221). ## Notes on Usage -- There are 10 Math and 10 MCQ prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. These prompts can be used with any Math (AIME, comp-math-24-25, etc) and MCQ (GPQA, MMLU-Pro, etc) benchmarks. -- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math and MCQ datasets (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work. +- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. +- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work. diff --git a/nemo_skills/evaluation/evaluator/mcq.py b/nemo_skills/evaluation/evaluator/mcq.py index 2b0eeea10c..821f1a47f8 100644 --- a/nemo_skills/evaluation/evaluator/mcq.py +++ b/nemo_skills/evaluation/evaluator/mcq.py @@ -31,7 +31,7 @@ class MCQEvaluatorConfig(BaseEvaluatorConfig): # only used if extract_from_boxed is False extract_regex: str = r"The final answer is (.+)$" # if relaxed is True: - # extract from boxed FIRST, if not found, extract from regex + # extract from regex FIRST, if not found, extract from boxed # if relaxed is False: # if extract_from_boxed is True -> extract from boxed{} ONLY # else extract from regex ONLY diff --git a/nemo_skills/evaluation/math_grader.py b/nemo_skills/evaluation/math_grader.py index dd75529c4b..4000265374 100644 --- a/nemo_skills/evaluation/math_grader.py +++ b/nemo_skills/evaluation/math_grader.py @@ -103,11 +103,11 @@ def extract_answer( string: str, extract_from_boxed: bool = True, extract_regex: str = r"The final answer is (.+)$", relaxed=False ): """Extract Answer String from \\boxed expression or based on regex - If relaxed=True: try both methods, boxed first. + If relaxed=True: try both methods, regex first. If relaxed=False: use only one method based on extract_from_boxed flag. """ if relaxed: - return search_boxed(string) or search_regex(string, extract_regex) + return search_regex(string, extract_regex) or search_boxed(string) if extract_from_boxed: return search_boxed(string) diff --git a/nemo_skills/pipeline/robust_eval.py b/nemo_skills/pipeline/robust_eval.py index 0ea2f820b1..1959f12ae7 100644 --- a/nemo_skills/pipeline/robust_eval.py +++ b/nemo_skills/pipeline/robust_eval.py @@ -13,6 +13,7 @@ # limitations under the License. import inspect import logging +import shlex from copy import deepcopy from dataclasses import dataclass from pathlib import Path @@ -115,7 +116,11 @@ def robust_eval( prompt_context = deepcopy(ctx) prompt = PromptConfig(**prompt) if prompt.extract_regex: - prompt_context.args.append(f"++eval_config.extract_regex='\"{prompt.extract_regex}\"'") + hydra_arg = f"++eval_config.extract_regex='{prompt.extract_regex}'" + # Quote properly for it to be correct passed to ns eval in terminal + shell_safe_arg = shlex.quote(shlex.quote(hydra_arg)) + prompt_context.args.append(shell_safe_arg) + prompt_context.args.append(f"++prompt_config={prompt.prompt_config}") prompt_kwargs = deepcopy(ns_eval_kwargs) diff --git a/nemo_skills/pipeline/summarize_robustness.py b/nemo_skills/pipeline/summarize_robustness.py index bc7828b7d1..d36d4bc184 100644 --- a/nemo_skills/pipeline/summarize_robustness.py +++ b/nemo_skills/pipeline/summarize_robustness.py @@ -17,8 +17,6 @@ import logging import os import tempfile -from collections import defaultdict -from itertools import combinations from pathlib import Path from typing import List, Optional @@ -58,7 +56,7 @@ def get_metrics(prediction_files: List[str]) -> List[float] | List[float]: per_file_metrics = [] no_answer = [] for pred_file in prediction_files: - metrics_calculator = ComputeMetrics(benchmark="custom", metric_type="math", max_samples=-1) + metrics_calculator = ComputeMetrics(benchmark=Path(pred_file).parent.name, max_samples=-1) metrics_calculator.calculator = metrics_calculator.get_metrics_calculator() with open(pred_file, "rt", encoding="utf-8") as f: @@ -66,60 +64,18 @@ def get_metrics(prediction_files: List[str]) -> List[float] | List[float]: data = read_predictions([line], idx, [f]) metrics_calculator.calculator.update(data) metrics = metrics_calculator.calculator.get_metrics() - per_file_metrics.append(metrics["pass@1"]["symbolic_correct"]) - no_answer.append(metrics["pass@1"]["no_answer"]) + for acc_key in ["judge_correct", "symbolic_correct", "accuracy"]: + if acc_key in metrics["pass@1"]: + per_file_metrics.append(metrics["pass@1"][acc_key]) + break + else: + LOG.warning(f"Could not find accuracy metric in {pred_file}, setting to -1.") + per_file_metrics.append(-1) + no_answer.append(metrics["pass@1"].get("no_answer", -1)) return per_file_metrics, no_answer -def calculate_similarity(answer1: str | None, answer2: str | None) -> float: - if answer1 is None and answer2 is None: - return 0 - return 1 if answer1 == answer2 else 0 - - -def calculate_consistency_rate(input_files: List[str]) -> float: - """Calculate the consistency rate across multiple input files. - Metric proposed in https://arxiv.org/abs/2403.14221 - - Args: - input_files: List of file paths containing predictions - - Returns: - float: Average consistency rate as a percentage (0-100) - - For each datapoint, collect all predictions, and - calculate similarity between all possible pairs of predictions. - The consistency rate is the number of pairs of equivalent prediction pairs - divided by the total number of prediction pairs (N choose 2). - - Example: - If datapoint i has predictions [A, A, C] across 3 files, it will - compare pairs (A,A), (A, C) and (A, C) and consistency rate will be 1/3 = 33.33%. - - """ - per_idx_preds = defaultdict(list) - for inp_f in input_files: - with open(inp_f, "rt", encoding="utf-8") as f: - for idx, line in enumerate(f): - data = read_predictions([line], idx, [f]) - per_idx_preds[idx].append(data[0]["predicted_answer"]) - responses = per_idx_preds.values() - total_similarity = 0 - total_combinations = 0 - - for response_set in responses: - if len(response_set) < 2: - continue - for answer1, answer2 in combinations(response_set, 2): - total_similarity += calculate_similarity(answer1, answer2) - total_combinations += 1 - - if total_combinations == 0: - return 100.0 - return round(total_similarity / total_combinations * 100, 2) - - @app.command() @typer_unpacker def summarize_robustness( @@ -168,7 +124,6 @@ def summarize_robustness( Calculates the following both per benchmark across prompts and per prompt across random seeds: - Statistical metrics: min, max, average, standard deviation - - Consistency Rate (CR): Agreement between different model runs - No-answer rate: Proportion of questions without answers - Cross-prompt standard deviation of averages @@ -216,8 +171,10 @@ def summarize_robustness( if benchmarks_paths: # Ascertain that the benchmarks_paths are valid for benchmark_path in benchmarks_paths: - # Valid benchmark_path should contain output*jsonl files - if len(glob.glob(f"{benchmark_path}/**/output*jsonl", recursive=True)) == 0: + # Valid benchmark_path should contain output*jsonl files excluding output.jsonl and chunked files + pred_files = glob.glob(f"{benchmark_path}/**/eval-results/*/output*jsonl", recursive=True) + pred_files = [f for f in pred_files if Path(f).name != "output.jsonl" and "_chunk_" not in Path(f).name] + if len(pred_files) == 0: raise ValueError(f"The benchmark directory {benchmark_path} lacks output*jsonl files.") else: print(f"No benchmarks found in {results_dir}") @@ -227,15 +184,13 @@ def summarize_robustness( print("Calculating robustness metrics for benchmarks found:", benchmarks_paths) for benchmark_path in sorted(benchmarks_paths): # sorting to ensure consistent order benchmark = str(Path(benchmark_path).name) - if not Path(benchmark_path).is_dir(): - continue - metrics_to_print[benchmark] = dict() # calculate metrics per prompt all_eval_metrics = [] for prompt_dir in sorted(glob.glob(f"{benchmark_path}/*")): prompt_name = str(Path(prompt_dir).name) - input_files = glob.glob(f"{prompt_dir}/**/output-rs*.jsonl", recursive=True) + input_files = glob.glob(f"{prompt_dir}/**/eval-results/*/output-rs*.jsonl", recursive=True) + input_files = [f for f in input_files if Path(f).name != "output.jsonl" and "_chunk_" not in Path(f).name] if not input_files: print("No input files found for prompt", prompt_dir) continue @@ -249,9 +204,6 @@ def summarize_robustness( "num_seeds": len(per_file_metrics), } all_eval_metrics.extend(per_file_metrics) - # calculate consistency rate per prompt - consistency_rate = calculate_consistency_rate(input_files) - metrics_to_print[benchmark][prompt_name]["CR"] = consistency_rate # calculate metrics across all prompts and seeds metrics_to_print[benchmark]["aggregated"] = { @@ -262,16 +214,13 @@ def summarize_robustness( "num_runs": len(all_eval_metrics), } - input_files = glob.glob(f"{benchmark_path}/**/output-rs*.jsonl", recursive=True) - consistency_rate = calculate_consistency_rate(input_files) - metrics_to_print[benchmark]["aggregated"]["CR"] = consistency_rate - # calculate the std of prompt averages for benchmark, metrics in metrics_to_print.items(): prompt_avgs = [m["avg"] for k, m in metrics.items() if k != "aggregated"] - metrics_to_print[benchmark]["aggregated"]["prompt_sensitivity"] = np.std(prompt_avgs) + prompt_std = np.std(prompt_avgs) + metrics_to_print[benchmark]["aggregated"]["prompt_sensitivity"] = prompt_std - header_fields = ["min", "max", "avg", "std", "CR", "prompt_sensitivity"] + header_fields = ["min", "max", "avg", "std", "prompt_sensitivity"] header = f"{'dataset':<20} | " header += " | ".join(f"{stat}".center(7) for stat in header_fields) print(header) @@ -279,14 +228,15 @@ def summarize_robustness( # Print aggregated stats for benchmark in metrics_to_print.keys(): bench_runs = f"{benchmark}@{metrics_to_print[benchmark]['aggregated']['num_runs']}" - row = f"{bench_runs:<20} | " + " | ".join( - f"{metrics_to_print[benchmark]['aggregated'][stat]:.2f}".center(7) for stat in header_fields - ) - print(row) + row = f"{bench_runs:<20} | " + for stat in header_fields: + value = metrics_to_print[benchmark]["aggregated"][stat] + row += f"{value:.2f}".center(max(len(stat), 7)) + " | " + print(row[:-3]) print("\n") # Print stats per prompt - header_fields = ["min", "max", "avg", "std", "CR", "no_answer"] + header_fields = ["min", "max", "avg", "std", "no_answer"] for benchmark, metrics in metrics_to_print.items(): print(f" {benchmark} ".center(len(header), "-")) num_seeds = metrics["aggregated"]["num_runs"] // (len(metrics) - 1) # excluding aggregated diff --git a/nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml b/nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml new file mode 100644 index 0000000000..51b474eee9 --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml @@ -0,0 +1,10 @@ +# https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/aai/livecodebench.yaml +# almost identical to https://artificialanalysis.ai/methodology/intelligence-benchmarking#intelligence-index-evaluation-suite-overview +# except we don't add ### before Format. +# The starter code and formatting instructions are already included inside question by prepare.py + +user: |- + ### Question: + {question} + + ### Answer: (use the provided format with backticks) diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml new file mode 100644 index 0000000000..006f8d1bda --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml @@ -0,0 +1,3 @@ +user: |- + {question} + Generate a correct Python program that matches the specification and passes all tests for this question. diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml new file mode 100644 index 0000000000..bd4cea64b3 --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml @@ -0,0 +1,6 @@ +user: |- + {question} + Write an executable Python solution. The output should look like: + ```python + [Insert the Python code here.] + ``` diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml new file mode 100644 index 0000000000..70cd6953a7 --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml @@ -0,0 +1,8 @@ +user: |- + You are a helpful and harmless assistant. Analyze the following problem, think step-by-step before solving the problem below. + {question} + Please use python programming language only. + You must use ```python for just the final solution code block with the following format: + ```python + # Your code here + ``` diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml new file mode 100644 index 0000000000..930127019e --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml @@ -0,0 +1,19 @@ +user: |- + PROBLEM DESCRIPTION: + You will be provided with the description of a python coding problem. Your task is to solve the problem step by step. + + RESPONSE GUIDELINES: + 1. Start with the knowledge required for the solution and the plan on how you will solve the problem.. + 2. Then write the complete and executable Python program. + 3. Your response should be correct and pass all tests. + 4. DO NOT include example usage or test code in your response. + 5. Ensure your response is in the format of ```python``` and includes the necessary background as a comment at the top. + + Example: + ```python + # Background: [Here, insert the necessary knowledge required for the solution and the plan on how you will solve the problem.] + + [Insert the Python code here.] + ``` + + {question} diff --git a/nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml b/nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml new file mode 100644 index 0000000000..dba1c7a91d --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml @@ -0,0 +1,12 @@ +# default prompt for all python based code benchmark evaluations +# https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/generic/codegen.yaml +user: |- + Here is a problem for which you need to generate/complete code: + {question} + + Please continue to complete the function with python programming language. You are not allowed to modify the given code and do the completion only. + + The solution should be in the following format: + ```python + # Your code here + ``` diff --git a/nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml b/nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml new file mode 100644 index 0000000000..c7337420a3 --- /dev/null +++ b/nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml @@ -0,0 +1,7 @@ +# /home/gnalbandyan/grigor/code/ns_pr/Skills/nemo_skills/prompt/config/eval/livecodebench/python_codegen.yaml +# default prompt for livecodebench Python + +user: |- + Here is a problem for which you need to generate an executable code in python programming language. + + {question} diff --git a/nemo_skills/prompt/config/robustness/prompt_set_config.yaml b/nemo_skills/prompt/config/robustness/prompt_set_config.yaml index 9bc260d059..f00486bd2f 100644 --- a/nemo_skills/prompt/config/robustness/prompt_set_config.yaml +++ b/nemo_skills/prompt/config/robustness/prompt_set_config.yaml @@ -28,3 +28,13 @@ comp-math-24-25: - prompt_config: robustness/math_prompts/boxed_8 - prompt_config: robustness/math_prompts/boxed_aai - prompt_config: robustness/math_prompts/boxed_general + + +livecodebench: + - prompt_config: robustness/code_prompts/aai_prompt + - prompt_config: robustness/code_prompts/ns_gen_codegen + - prompt_config: robustness/code_prompts/ns_python_codegen + - prompt_config: robustness/code_prompts/code_1 + - prompt_config: robustness/code_prompts/code_2 + - prompt_config: robustness/code_prompts/code_3 + - prompt_config: robustness/code_prompts/code_4