Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 13 additions & 20 deletions docs/evaluation/robustness.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,10 @@ The following metrics are calculated:

* **Aggregated Benchmark Statistics**: For each benchmark across all prompts and seeds, the script calculates:
- `min`, `max`, `avg`, `std`: Statistical metrics across all runs per benchmark.
- `CR` (Consistency Rate): The average rate of agreement of model predictions on the same datapoint across different runs.
- `prompt_sensitivity`: The standard deviation of the average scores across different prompts, which measures how sensitive the model's accuracy is to prompt variations.

* **Per-Prompt Statistics**: For each prompt across all random seeds, the script calculates:
- `min`, `max`, `avg`, `std`: Statistical metrics for a single prompt across seeds.
- `CR` (Consistency Rate): The average rate of agreement of model predictions on the same question across different runs.
- `no_answer`: The proportion of questions for which no answer was extracted from the generation, either due to a wrong answer format or no answer at all (can be used to find prompts that break the model predictions).


Expand All @@ -84,34 +82,29 @@ First, for each benchmark, metrics are aggregated across all prompts and seeds.
All calculated metrics are also saved to `output_dir/metrics.json`. </br>

```
dataset | min | max | avg | std | CR | prompt_sensitivity
-------------------------------------------------------------------------------------------
comp-math-24-25@80 | 48.05 | 53.91 | 51.10 | 1.60 | 55.25 | 0.34
gpqa@80 | 50.51 | 60.61 | 55.51 | 2.44 | 65.15 | 0.77
dataset | min | max | avg | std | prompt_sensitivity
----------------------------------------------------------------------------------
comp-math-24-25@80 | 48.05 | 53.91 | 51.10 | 1.60 | 0.34
gpqa@80 | 50.51 | 60.61 | 55.51 | 2.44 | 0.77


------------------------------------- comp-math-24-25 -------------------------------------
prompt@8 | min | max | avg | std | CR | no_answer
------------------------------------- comp-math-24-25 ----------------------------
prompt@8 | min | max | avg | std | no_answer
----------------------------------------------------------------------------------
prompt_1 | 48.05 | 53.91 | 50.76 | 1.61 | 55.48 | 1.56
prompt_1 | 48.05 | 53.91 | 50.76 | 1.61 | 1.56
...
prompt_10 | 48.44 | 53.91 | 51.44 | 1.52 | 55.13 | 1.66
prompt_10 | 48.44 | 53.91 | 51.44 | 1.52 | 1.66


-------------------------------------- gpqa --------------------------------------
prompt@8 | min | max | avg | std | CR | no_answer
prompt@8 | min | max | avg | std | no_answer
----------------------------------------------------------------------------------
prompt_1 | 50.51 | 60.61 | 54.73 | 2.68 | 64.20 | 3.03
prompt_1 | 50.51 | 60.61 | 54.73 | 2.68 | 3.03
...
prompt_10 | 53.54 | 60.10 | 56.28 | 1.88 | 66.59 | 2.78
prompt_10 | 53.54 | 60.10 | 56.28 | 1.88 | 2.78
```

### Consistency Rate
For each datapoint, collect all predictions and calculate the similarity between all possible pairs of predictions.
The consistency rate is the number of pairs of equivalent prediction pairs divided by the total number of prediction pairs (N choose 2).</br>
Example: For a datapoint with predictions [A, A, C] across 3 files, it will compare pairs (A, A), (A, C), and (A, C), and the consistency rate will be 1/3 = 33.33%.</br>
Consistency rate is proposed in [Improving the Robustness of Large Language Models via Consistency Alignment](https://arxiv.org/abs/2403.14221).

## Notes on Usage
- There are 10 Math and 10 MCQ prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. These prompts can be used with any Math (AIME, comp-math-24-25, etc) and MCQ (GPQA, MMLU-Pro, etc) benchmarks.
- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math and MCQ datasets (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
Comment on lines +109 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add periods after abbreviations in American English.

The abbreviation "etc" appears twice without trailing periods at line 109, which is required in American English style. Additionally, line 109 could be clearer about the prompt folder organization.

Apply this diff to fix the style issues:

- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
+ There are 10 Math, 10 MCQ, and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc.), benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc.) benchmarks.

Note: Changed "and 7" to ", and 7" (Oxford comma), and added periods after both instances of "etc."

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 LanguageTool

[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...ed for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/...

(ETC_PERIOD)


[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...q_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. - robust_eval can be used ...

(ETC_PERIOD)

🤖 Prompt for AI Agents
In docs/evaluation/robustness.md around lines 109 to 110, update the sentence
describing the prompt counts and folder usage to use American English
punctuation and clearer folder organization: add periods after both instances of
"etc." and insert the Oxford comma (change "and 7" to ", and 7"); reword the
sentence slightly to make clear which folders map to which benchmark types
(e.g., state that prompt/config/robustness/math_prompts is for Math benchmarks
and prompt/config/robustness/mcq_prompts is for MCQ benchmarks) so the
folder-to-benchmark mapping is explicit.

2 changes: 1 addition & 1 deletion nemo_skills/evaluation/evaluator/mcq.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class MCQEvaluatorConfig(BaseEvaluatorConfig):
# only used if extract_from_boxed is False
extract_regex: str = r"The final answer is (.+)$"
# if relaxed is True:
# extract from boxed FIRST, if not found, extract from regex
# extract from regex FIRST, if not found, extract from boxed
# if relaxed is False:
# if extract_from_boxed is True -> extract from boxed{} ONLY
# else extract from regex ONLY
Expand Down
4 changes: 2 additions & 2 deletions nemo_skills/evaluation/math_grader.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,11 +103,11 @@ def extract_answer(
string: str, extract_from_boxed: bool = True, extract_regex: str = r"The final answer is (.+)$", relaxed=False
):
"""Extract Answer String from \\boxed expression or based on regex
If relaxed=True: try both methods, boxed first.
If relaxed=True: try both methods, regex first.
If relaxed=False: use only one method based on extract_from_boxed flag.
"""
if relaxed:
return search_boxed(string) or search_regex(string, extract_regex)
return search_regex(string, extract_regex) or search_boxed(string)

if extract_from_boxed:
return search_boxed(string)
Expand Down
7 changes: 6 additions & 1 deletion nemo_skills/pipeline/robust_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# limitations under the License.
import inspect
import logging
import shlex
from copy import deepcopy
from dataclasses import dataclass
from pathlib import Path
Expand Down Expand Up @@ -115,7 +116,11 @@ def robust_eval(
prompt_context = deepcopy(ctx)
prompt = PromptConfig(**prompt)
if prompt.extract_regex:
prompt_context.args.append(f"++eval_config.extract_regex='\"{prompt.extract_regex}\"'")
hydra_arg = f"++eval_config.extract_regex='{prompt.extract_regex}'"
# Quote properly for it to be correct passed to ns eval in terminal
shell_safe_arg = shlex.quote(shlex.quote(hydra_arg))
prompt_context.args.append(shell_safe_arg)

prompt_context.args.append(f"++prompt_config={prompt.prompt_config}")

prompt_kwargs = deepcopy(ns_eval_kwargs)
Expand Down
98 changes: 24 additions & 74 deletions nemo_skills/pipeline/summarize_robustness.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@
import logging
import os
import tempfile
from collections import defaultdict
from itertools import combinations
from pathlib import Path
from typing import List, Optional

Expand Down Expand Up @@ -58,68 +56,26 @@ def get_metrics(prediction_files: List[str]) -> List[float] | List[float]:
per_file_metrics = []
no_answer = []
for pred_file in prediction_files:
metrics_calculator = ComputeMetrics(benchmark="custom", metric_type="math", max_samples=-1)
metrics_calculator = ComputeMetrics(benchmark=Path(pred_file).parent.name, max_samples=-1)
metrics_calculator.calculator = metrics_calculator.get_metrics_calculator()

with open(pred_file, "rt", encoding="utf-8") as f:
for idx, line in enumerate(f):
data = read_predictions([line], idx, [f])
metrics_calculator.calculator.update(data)
metrics = metrics_calculator.calculator.get_metrics()
per_file_metrics.append(metrics["pass@1"]["symbolic_correct"])
no_answer.append(metrics["pass@1"]["no_answer"])
for acc_key in ["judge_correct", "symbolic_correct", "accuracy"]:
if acc_key in metrics["pass@1"]:
per_file_metrics.append(metrics["pass@1"][acc_key])
break
else:
LOG.warning(f"Could not find accuracy metric in {pred_file}, setting to -1.")
per_file_metrics.append(-1)
no_answer.append(metrics["pass@1"].get("no_answer", -1))

return per_file_metrics, no_answer


def calculate_similarity(answer1: str | None, answer2: str | None) -> float:
if answer1 is None and answer2 is None:
return 0
return 1 if answer1 == answer2 else 0


def calculate_consistency_rate(input_files: List[str]) -> float:
"""Calculate the consistency rate across multiple input files.
Metric proposed in https://arxiv.org/abs/2403.14221

Args:
input_files: List of file paths containing predictions

Returns:
float: Average consistency rate as a percentage (0-100)

For each datapoint, collect all predictions, and
calculate similarity between all possible pairs of predictions.
The consistency rate is the number of pairs of equivalent prediction pairs
divided by the total number of prediction pairs (N choose 2).

Example:
If datapoint i has predictions [A, A, C] across 3 files, it will
compare pairs (A,A), (A, C) and (A, C) and consistency rate will be 1/3 = 33.33%.

"""
per_idx_preds = defaultdict(list)
for inp_f in input_files:
with open(inp_f, "rt", encoding="utf-8") as f:
for idx, line in enumerate(f):
data = read_predictions([line], idx, [f])
per_idx_preds[idx].append(data[0]["predicted_answer"])
responses = per_idx_preds.values()
total_similarity = 0
total_combinations = 0

for response_set in responses:
if len(response_set) < 2:
continue
for answer1, answer2 in combinations(response_set, 2):
total_similarity += calculate_similarity(answer1, answer2)
total_combinations += 1

if total_combinations == 0:
return 100.0
return round(total_similarity / total_combinations * 100, 2)


@app.command()
@typer_unpacker
def summarize_robustness(
Expand Down Expand Up @@ -168,7 +124,6 @@ def summarize_robustness(

Calculates the following both per benchmark across prompts and per prompt across random seeds:
- Statistical metrics: min, max, average, standard deviation
- Consistency Rate (CR): Agreement between different model runs
- No-answer rate: Proportion of questions without answers
- Cross-prompt standard deviation of averages

Expand Down Expand Up @@ -216,8 +171,10 @@ def summarize_robustness(
if benchmarks_paths:
# Ascertain that the benchmarks_paths are valid
for benchmark_path in benchmarks_paths:
# Valid benchmark_path should contain output*jsonl files
if len(glob.glob(f"{benchmark_path}/**/output*jsonl", recursive=True)) == 0:
# Valid benchmark_path should contain output*jsonl files excluding output.jsonl and chunked files
pred_files = glob.glob(f"{benchmark_path}/**/eval-results/*/output*jsonl", recursive=True)
pred_files = [f for f in pred_files if Path(f).name != "output.jsonl" and "_chunk_" not in Path(f).name]
if len(pred_files) == 0:
raise ValueError(f"The benchmark directory {benchmark_path} lacks output*jsonl files.")
else:
print(f"No benchmarks found in {results_dir}")
Expand All @@ -227,15 +184,13 @@ def summarize_robustness(
print("Calculating robustness metrics for benchmarks found:", benchmarks_paths)
for benchmark_path in sorted(benchmarks_paths): # sorting to ensure consistent order
benchmark = str(Path(benchmark_path).name)
if not Path(benchmark_path).is_dir():
continue

metrics_to_print[benchmark] = dict()
# calculate metrics per prompt
all_eval_metrics = []
for prompt_dir in sorted(glob.glob(f"{benchmark_path}/*")):
prompt_name = str(Path(prompt_dir).name)
input_files = glob.glob(f"{prompt_dir}/**/output-rs*.jsonl", recursive=True)
input_files = glob.glob(f"{prompt_dir}/**/eval-results/*/output-rs*.jsonl", recursive=True)
input_files = [f for f in input_files if Path(f).name != "output.jsonl" and "_chunk_" not in Path(f).name]
if not input_files:
print("No input files found for prompt", prompt_dir)
continue
Expand All @@ -249,9 +204,6 @@ def summarize_robustness(
"num_seeds": len(per_file_metrics),
}
all_eval_metrics.extend(per_file_metrics)
# calculate consistency rate per prompt
consistency_rate = calculate_consistency_rate(input_files)
metrics_to_print[benchmark][prompt_name]["CR"] = consistency_rate

# calculate metrics across all prompts and seeds
metrics_to_print[benchmark]["aggregated"] = {
Expand All @@ -262,31 +214,29 @@ def summarize_robustness(
"num_runs": len(all_eval_metrics),
}

input_files = glob.glob(f"{benchmark_path}/**/output-rs*.jsonl", recursive=True)
consistency_rate = calculate_consistency_rate(input_files)
metrics_to_print[benchmark]["aggregated"]["CR"] = consistency_rate

# calculate the std of prompt averages
for benchmark, metrics in metrics_to_print.items():
prompt_avgs = [m["avg"] for k, m in metrics.items() if k != "aggregated"]
metrics_to_print[benchmark]["aggregated"]["prompt_sensitivity"] = np.std(prompt_avgs)
prompt_std = np.std(prompt_avgs)
metrics_to_print[benchmark]["aggregated"]["prompt_sensitivity"] = prompt_std

header_fields = ["min", "max", "avg", "std", "CR", "prompt_sensitivity"]
header_fields = ["min", "max", "avg", "std", "prompt_sensitivity"]
header = f"{'dataset':<20} | "
header += " | ".join(f"{stat}".center(7) for stat in header_fields)
print(header)
print("-" * len(header))
# Print aggregated stats
for benchmark in metrics_to_print.keys():
bench_runs = f"{benchmark}@{metrics_to_print[benchmark]['aggregated']['num_runs']}"
row = f"{bench_runs:<20} | " + " | ".join(
f"{metrics_to_print[benchmark]['aggregated'][stat]:.2f}".center(7) for stat in header_fields
)
print(row)
row = f"{bench_runs:<20} | "
for stat in header_fields:
value = metrics_to_print[benchmark]["aggregated"][stat]
row += f"{value:.2f}".center(max(len(stat), 7)) + " | "
print(row[:-3])
print("\n")

# Print stats per prompt
header_fields = ["min", "max", "avg", "std", "CR", "no_answer"]
header_fields = ["min", "max", "avg", "std", "no_answer"]
for benchmark, metrics in metrics_to_print.items():
print(f" {benchmark} ".center(len(header), "-"))
num_seeds = metrics["aggregated"]["num_runs"] // (len(metrics) - 1) # excluding aggregated
Expand Down
10 changes: 10 additions & 0 deletions nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/aai/livecodebench.yaml
# almost identical to https://artificialanalysis.ai/methodology/intelligence-benchmarking#intelligence-index-evaluation-suite-overview
# except we don't add ### before Format.
# The starter code and formatting instructions are already included inside question by prepare.py

user: |-
### Question:
{question}

### Answer: (use the provided format with backticks)
3 changes: 3 additions & 0 deletions nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
user: |-
{question}
Generate a correct Python program that matches the specification and passes all tests for this question.
6 changes: 6 additions & 0 deletions nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
user: |-
{question}
Write an executable Python solution. The output should look like:
```python
[Insert the Python code here.]
```
8 changes: 8 additions & 0 deletions nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
user: |-
You are a helpful and harmless assistant. Analyze the following problem, think step-by-step before solving the problem below.
{question}
Please use python programming language only.
You must use ```python for just the final solution code block with the following format:
```python
# Your code here
```
19 changes: 19 additions & 0 deletions nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
user: |-
PROBLEM DESCRIPTION:
You will be provided with the description of a python coding problem. Your task is to solve the problem step by step.

RESPONSE GUIDELINES:
1. Start with the knowledge required for the solution and the plan on how you will solve the problem..
2. Then write the complete and executable Python program.
3. Your response should be correct and pass all tests.
4. DO NOT include example usage or test code in your response.
5. Ensure your response is in the format of ```python``` and includes the necessary background as a comment at the top.

Example:
```python
# Background: [Here, insert the necessary knowledge required for the solution and the plan on how you will solve the problem.]

[Insert the Python code here.]
```

{question}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# default prompt for all python based code benchmark evaluations
# https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/generic/codegen.yaml
user: |-
Here is a problem for which you need to generate/complete code:
{question}

Please continue to complete the function with python programming language. You are not allowed to modify the given code and do the completion only.

The solution should be in the following format:
```python
# Your code here
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# /home/gnalbandyan/grigor/code/ns_pr/Skills/nemo_skills/prompt/config/eval/livecodebench/python_codegen.yaml
# default prompt for livecodebench Python

user: |-
Here is a problem for which you need to generate an executable code in python programming language.

{question}
10 changes: 10 additions & 0 deletions nemo_skills/prompt/config/robustness/prompt_set_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,13 @@ comp-math-24-25:
- prompt_config: robustness/math_prompts/boxed_8
- prompt_config: robustness/math_prompts/boxed_aai
- prompt_config: robustness/math_prompts/boxed_general


livecodebench:
- prompt_config: robustness/code_prompts/aai_prompt
- prompt_config: robustness/code_prompts/ns_gen_codegen
- prompt_config: robustness/code_prompts/ns_python_codegen
- prompt_config: robustness/code_prompts/code_1
- prompt_config: robustness/code_prompts/code_2
- prompt_config: robustness/code_prompts/code_3
- prompt_config: robustness/code_prompts/code_4