NVIDIA-NeMo · gnalbandyan · Dec 9, 2025 · Dec 5, 2025 · Dec 5, 2025 · Dec 5, 2025
diff --git a/docs/evaluation/robustness.md b/docs/evaluation/robustness.md
@@ -70,12 +70,10 @@ The following metrics are calculated:
 
 * **Aggregated Benchmark Statistics**: For each benchmark across all prompts and seeds, the script calculates:
     - `min`, `max`, `avg`, `std`: Statistical metrics across all runs per benchmark.
-    - `CR` (Consistency Rate): The average rate of agreement of model predictions on the same datapoint across different runs.
     - `prompt_sensitivity`: The standard deviation of the average scores across different prompts, which measures how sensitive the model's accuracy is to prompt variations.
 
 * **Per-Prompt Statistics**: For each prompt across all random seeds, the script calculates:
     - `min`, `max`, `avg`, `std`: Statistical metrics for a single prompt across seeds.
-    - `CR` (Consistency Rate): The average rate of agreement of model predictions on the same question across different runs.
     - `no_answer`: The proportion of questions for which no answer was extracted from the generation, either due to a wrong answer format or no answer at all (can be used to find prompts that break the model predictions).
 
 
@@ -84,34 +82,29 @@ First, for each benchmark, metrics are aggregated across all prompts and seeds.
 All calculated metrics are also saved to `output_dir/metrics.json`. </br>
 
 ```
-dataset              |   min   |   max   |   avg   |   std   |    CR   | prompt_sensitivity
--------------------------------------------------------------------------------------------
-comp-math-24-25@80   |  48.05  |  53.91  |  51.10  |   1.60  |  55.25  |   0.34
-gpqa@80              |  50.51  |  60.61  |  55.51  |   2.44  |  65.15  |   0.77
+dataset              |   min   |   max   |   avg   |   std   | prompt_sensitivity
+----------------------------------------------------------------------------------
+comp-math-24-25@80   |  48.05  |  53.91  |  51.10  |   1.60  |  0.34
+gpqa@80              |  50.51  |  60.61  |  55.51  |   2.44  |  0.77
 
 
-------------------------------------- comp-math-24-25 -------------------------------------
-prompt@8             |   min   |   max   |   avg   |   std   |    CR   | no_answer
+------------------------------------- comp-math-24-25 ----------------------------
+prompt@8             |   min   |   max   |   avg   |   std   | no_answer
 ----------------------------------------------------------------------------------
-prompt_1             |  48.05  |  53.91  |  50.76  |   1.61  |  55.48  |   1.56
+prompt_1             |  48.05  |  53.91  |  50.76  |   1.61  | 1.56
 ...
-prompt_10            |  48.44  |  53.91  |  51.44  |   1.52  |  55.13  |   1.66
+prompt_10            |  48.44  |  53.91  |  51.44  |   1.52  | 1.66
 
 
 -------------------------------------- gpqa --------------------------------------
-prompt@8             |   min   |   max   |   avg   |   std   |    CR   | no_answer
+prompt@8             |   min   |   max   |   avg   |   std   |  no_answer
 ----------------------------------------------------------------------------------
-prompt_1             |  50.51  |  60.61  |  54.73  |   2.68  |  64.20  |   3.03
+prompt_1             |  50.51  |  60.61  |  54.73  |   2.68  |  3.03
 ...
-prompt_10            |  53.54  |  60.10  |  56.28  |   1.88  |  66.59  |   2.78
+prompt_10            |  53.54  |  60.10  |  56.28  |   1.88  |  2.78
 ```
 
-### Consistency Rate
-For each datapoint, collect all predictions and calculate the similarity between all possible pairs of predictions.
-The consistency rate is the number of pairs of equivalent prediction pairs divided by the total number of prediction pairs (N choose 2).</br>
-Example: For a datapoint with predictions [A, A, C] across 3 files, it will compare pairs (A, A), (A, C), and (A, C), and the consistency rate will be 1/3 = 33.33%.</br>
-Consistency rate is proposed in [Improving the Robustness of Large Language Models via Consistency Alignment](https://arxiv.org/abs/2403.14221).
 
 ## Notes on Usage
-- There are 10 Math and 10 MCQ prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. These prompts can be used with any Math (AIME, comp-math-24-25, etc) and MCQ (GPQA, MMLU-Pro, etc) benchmarks.
-- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math and MCQ datasets (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
+- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
+- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
diff --git a/nemo_skills/evaluation/evaluator/mcq.py b/nemo_skills/evaluation/evaluator/mcq.py
@@ -31,7 +31,7 @@ class MCQEvaluatorConfig(BaseEvaluatorConfig):
     # only used if extract_from_boxed is False
     extract_regex: str = r"The final answer is (.+)$"
     # if relaxed is True:
-    #   extract from boxed FIRST, if not found, extract from regex
+    #   extract from regex FIRST, if not found, extract from boxed
     # if relaxed is False:
     #   if extract_from_boxed is True -> extract from boxed{} ONLY
     #   else extract from regex ONLY

diff --git a/nemo_skills/evaluation/math_grader.py b/nemo_skills/evaluation/math_grader.py
@@ -103,11 +103,11 @@ def extract_answer(
     string: str, extract_from_boxed: bool = True, extract_regex: str = r"The final answer is (.+)$", relaxed=False
 ):
     """Extract Answer String from \\boxed expression or based on regex
-    If relaxed=True: try both methods, boxed first.
+    If relaxed=True: try both methods, regex first.
     If relaxed=False: use only one method based on extract_from_boxed flag.
     """
     if relaxed:
-        return search_boxed(string) or search_regex(string, extract_regex)
+        return search_regex(string, extract_regex) or search_boxed(string)
 
     if extract_from_boxed:
         return search_boxed(string)

diff --git a/nemo_skills/pipeline/robust_eval.py b/nemo_skills/pipeline/robust_eval.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 import inspect
 import logging
+import shlex
 from copy import deepcopy
 from dataclasses import dataclass
 from pathlib import Path
@@ -115,7 +116,11 @@ def robust_eval(
                 prompt_context = deepcopy(ctx)
                 prompt = PromptConfig(**prompt)
                 if prompt.extract_regex:
-                    prompt_context.args.append(f"++eval_config.extract_regex='\"{prompt.extract_regex}\"'")
+                    hydra_arg = f"++eval_config.extract_regex='{prompt.extract_regex}'"
+                    # Quote properly for it to be correct passed to ns eval in terminal
+                    shell_safe_arg = shlex.quote(shlex.quote(hydra_arg))
+                    prompt_context.args.append(shell_safe_arg)
+
                 prompt_context.args.append(f"++prompt_config={prompt.prompt_config}")
 
                 prompt_kwargs = deepcopy(ns_eval_kwargs)

diff --git a/nemo_skills/pipeline/summarize_robustness.py b/nemo_skills/pipeline/summarize_robustness.py
@@ -17,8 +17,6 @@
 import logging
 import os
 import tempfile
-from collections import defaultdict
-from itertools import combinations
 from pathlib import Path
 from typing import List, Optional
 
@@ -58,68 +56,26 @@ def get_metrics(prediction_files: List[str]) -> List[float] | List[float]:
     per_file_metrics = []
     no_answer = []
     for pred_file in prediction_files:
-        metrics_calculator = ComputeMetrics(benchmark="custom", metric_type="math", max_samples=-1)
+        metrics_calculator = ComputeMetrics(benchmark=Path(pred_file).parent.name, max_samples=-1)
         metrics_calculator.calculator = metrics_calculator.get_metrics_calculator()
 
         with open(pred_file, "rt", encoding="utf-8") as f:
             for idx, line in enumerate(f):
                 data = read_predictions([line], idx, [f])
                 metrics_calculator.calculator.update(data)
         metrics = metrics_calculator.calculator.get_metrics()
-        per_file_metrics.append(metrics["pass@1"]["symbolic_correct"])
-        no_answer.append(metrics["pass@1"]["no_answer"])
+        for acc_key in ["judge_correct", "symbolic_correct", "accuracy"]:
+            if acc_key in metrics["pass@1"]:
+                per_file_metrics.append(metrics["pass@1"][acc_key])
+                break
+        else:
+            LOG.warning(f"Could not find accuracy metric in {pred_file}, setting to -1.")
+            per_file_metrics.append(-1)
+        no_answer.append(metrics["pass@1"].get("no_answer", -1))
 
     return per_file_metrics, no_answer
 
 
-def calculate_similarity(answer1: str | None, answer2: str | None) -> float:
-    if answer1 is None and answer2 is None:
-        return 0
-    return 1 if answer1 == answer2 else 0
-
-
-def calculate_consistency_rate(input_files: List[str]) -> float:
-    """Calculate the consistency rate across multiple input files.
-    Metric proposed in https://arxiv.org/abs/2403.14221
-
-    Args:
-        input_files: List of file paths containing predictions
-
-    Returns:
-        float: Average consistency rate as a percentage (0-100)
-
-    For each datapoint, collect all predictions, and
-    calculate similarity between all possible pairs of predictions.
-    The consistency rate is the number of pairs of equivalent prediction pairs
-    divided by the total number of prediction pairs (N choose 2).
-
-    Example:
-        If datapoint i has predictions [A, A, C] across 3 files, it will
-        compare pairs (A,A), (A, C) and (A, C) and consistency rate will be 1/3 = 33.33%.
-
-    """
-    per_idx_preds = defaultdict(list)
-    for inp_f in input_files:
-        with open(inp_f, "rt", encoding="utf-8") as f:
-            for idx, line in enumerate(f):
-                data = read_predictions([line], idx, [f])
-                per_idx_preds[idx].append(data[0]["predicted_answer"])
-    responses = per_idx_preds.values()
-    total_similarity = 0
-    total_combinations = 0
-
-    for response_set in responses:
-        if len(response_set) < 2:
-            continue
-        for answer1, answer2 in combinations(response_set, 2):
-            total_similarity += calculate_similarity(answer1, answer2)
-            total_combinations += 1
-
-    if total_combinations == 0:
-        return 100.0
-    return round(total_similarity / total_combinations * 100, 2)
-
-
 @app.command()
 @typer_unpacker
 def summarize_robustness(
@@ -168,7 +124,6 @@ def summarize_robustness(
 
     Calculates the following both per benchmark across prompts and per prompt across random seeds:
         - Statistical metrics: min, max, average, standard deviation
-        - Consistency Rate (CR): Agreement between different model runs
         - No-answer rate: Proportion of questions without answers
         - Cross-prompt standard deviation of averages
 
@@ -216,8 +171,10 @@ def summarize_robustness(
     if benchmarks_paths:
         # Ascertain that the benchmarks_paths are valid
         for benchmark_path in benchmarks_paths:
-            # Valid benchmark_path should contain output*jsonl files
-            if len(glob.glob(f"{benchmark_path}/**/output*jsonl", recursive=True)) == 0:
+            # Valid benchmark_path should contain output*jsonl files excluding output.jsonl and chunked files
+            pred_files = glob.glob(f"{benchmark_path}/**/eval-results/*/output*jsonl", recursive=True)
+            pred_files = [f for f in pred_files if Path(f).name != "output.jsonl" and "_chunk_" not in Path(f).name]
+            if len(pred_files) == 0:
                 raise ValueError(f"The benchmark directory {benchmark_path} lacks output*jsonl files.")
     else:
         print(f"No benchmarks found in {results_dir}")
@@ -227,15 +184,13 @@ def summarize_robustness(
     print("Calculating robustness metrics for benchmarks found:", benchmarks_paths)
     for benchmark_path in sorted(benchmarks_paths):  # sorting to ensure consistent order
         benchmark = str(Path(benchmark_path).name)
-        if not Path(benchmark_path).is_dir():
-            continue
-
         metrics_to_print[benchmark] = dict()
         # calculate metrics per prompt
         all_eval_metrics = []
         for prompt_dir in sorted(glob.glob(f"{benchmark_path}/*")):
             prompt_name = str(Path(prompt_dir).name)
-            input_files = glob.glob(f"{prompt_dir}/**/output-rs*.jsonl", recursive=True)
+            input_files = glob.glob(f"{prompt_dir}/**/eval-results/*/output-rs*.jsonl", recursive=True)
+            input_files = [f for f in input_files if Path(f).name != "output.jsonl" and "_chunk_" not in Path(f).name]
             if not input_files:
                 print("No input files found for prompt", prompt_dir)
                 continue
@@ -249,9 +204,6 @@ def summarize_robustness(
                 "num_seeds": len(per_file_metrics),
             }
             all_eval_metrics.extend(per_file_metrics)
-            # calculate consistency rate per prompt
-            consistency_rate = calculate_consistency_rate(input_files)
-            metrics_to_print[benchmark][prompt_name]["CR"] = consistency_rate
 
         # calculate metrics across all prompts and seeds
         metrics_to_print[benchmark]["aggregated"] = {
@@ -262,31 +214,29 @@ def summarize_robustness(
             "num_runs": len(all_eval_metrics),
         }
 
-        input_files = glob.glob(f"{benchmark_path}/**/output-rs*.jsonl", recursive=True)
-        consistency_rate = calculate_consistency_rate(input_files)
-        metrics_to_print[benchmark]["aggregated"]["CR"] = consistency_rate
-
     # calculate the std of prompt averages
     for benchmark, metrics in metrics_to_print.items():
         prompt_avgs = [m["avg"] for k, m in metrics.items() if k != "aggregated"]
-        metrics_to_print[benchmark]["aggregated"]["prompt_sensitivity"] = np.std(prompt_avgs)
+        prompt_std = np.std(prompt_avgs)
+        metrics_to_print[benchmark]["aggregated"]["prompt_sensitivity"] = prompt_std
 
-    header_fields = ["min", "max", "avg", "std", "CR", "prompt_sensitivity"]
+    header_fields = ["min", "max", "avg", "std", "prompt_sensitivity"]
     header = f"{'dataset':<20} | "
     header += " | ".join(f"{stat}".center(7) for stat in header_fields)
     print(header)
     print("-" * len(header))
     # Print aggregated stats
     for benchmark in metrics_to_print.keys():
         bench_runs = f"{benchmark}@{metrics_to_print[benchmark]['aggregated']['num_runs']}"
-        row = f"{bench_runs:<20} | " + " | ".join(
-            f"{metrics_to_print[benchmark]['aggregated'][stat]:.2f}".center(7) for stat in header_fields
-        )
-        print(row)
+        row = f"{bench_runs:<20} | "
+        for stat in header_fields:
+            value = metrics_to_print[benchmark]["aggregated"][stat]
+            row += f"{value:.2f}".center(max(len(stat), 7)) + " | "
+        print(row[:-3])
     print("\n")
 
     # Print stats per prompt
-    header_fields = ["min", "max", "avg", "std", "CR", "no_answer"]
+    header_fields = ["min", "max", "avg", "std", "no_answer"]
     for benchmark, metrics in metrics_to_print.items():
         print(f" {benchmark} ".center(len(header), "-"))
         num_seeds = metrics["aggregated"]["num_runs"] // (len(metrics) - 1)  # excluding aggregated

diff --git a/nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml b/nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml
@@ -0,0 +1,10 @@
+# https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/aai/livecodebench.yaml
+# almost identical to https://artificialanalysis.ai/methodology/intelligence-benchmarking#intelligence-index-evaluation-suite-overview
+# except we don't add ### before Format.
+# The starter code and formatting instructions are already included inside question by prepare.py
+
+user: |-
+  ### Question:
+  {question}
+
+  ### Answer: (use the provided format with backticks)
diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml
@@ -0,0 +1,3 @@
+user: |-
+  {question}
+  Generate a correct Python program that matches the specification and passes all tests for this question.
diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml
@@ -0,0 +1,6 @@
+user: |-
+  {question}
+  Write an executable Python solution. The output should look like:
+  ```python
+  [Insert the Python code here.]
+  ```
diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml
@@ -0,0 +1,8 @@
+user: |-
+  You are a helpful and harmless assistant. Analyze the following problem, think step-by-step before solving the problem below.
+  {question}
+  Please use python programming language only.
+  You must use ```python for just the final solution code block with the following format:
+  ```python
+  # Your code here
+  ```
diff --git a/nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml b/nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml
@@ -0,0 +1,19 @@
+user: |-
+  PROBLEM DESCRIPTION:
+  You will be provided with the description of a python coding problem. Your task is to solve the problem step by step.
+
+  RESPONSE GUIDELINES:
+  1. Start with the knowledge required for the solution and the plan on how you will solve the problem..
+  2. Then write the complete and executable Python program.
+  3. Your response should be correct and pass all tests.
+  4. DO NOT include example usage or test code in your response.
+  5. Ensure your response is in the format of ```python``` and includes the necessary background as a comment at the top.
+
+  Example:
+  ```python
+  # Background: [Here, insert the necessary knowledge required for the solution and the plan on how you will solve the problem.]
+
+  [Insert the Python code here.]
+  ```
+
+  {question}
diff --git a/nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml b/nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml
@@ -0,0 +1,12 @@
+# default prompt for all python based code benchmark evaluations
+# https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/generic/codegen.yaml
+user: |-
+  Here is a problem for which you need to generate/complete code:
+  {question}
+
+  Please continue to complete the function with python programming language. You are not allowed to modify the given code and do the completion only.
+
+  The solution should be in the following format:
+  ```python
+  # Your code here
+  ```
diff --git a/nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml b/nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml
@@ -0,0 +1,7 @@
+# /home/gnalbandyan/grigor/code/ns_pr/Skills/nemo_skills/prompt/config/eval/livecodebench/python_codegen.yaml
+# default prompt for livecodebench Python
+
+user: |-
+  Here is a problem for which you need to generate an executable code in python programming language.
+
+  {question}
diff --git a/nemo_skills/prompt/config/robustness/prompt_set_config.yaml b/nemo_skills/prompt/config/robustness/prompt_set_config.yaml
@@ -28,3 +28,13 @@ comp-math-24-25:
   - prompt_config: robustness/math_prompts/boxed_8
   - prompt_config: robustness/math_prompts/boxed_aai
   - prompt_config: robustness/math_prompts/boxed_general
+
+
+livecodebench:
+  - prompt_config: robustness/code_prompts/aai_prompt
+  - prompt_config: robustness/code_prompts/ns_gen_codegen
+  - prompt_config: robustness/code_prompts/ns_python_codegen
+  - prompt_config: robustness/code_prompts/code_1
+  - prompt_config: robustness/code_prompts/code_2
+  - prompt_config: robustness/code_prompts/code_3
+  - prompt_config: robustness/code_prompts/code_4