Skip to content

Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings.#1079

Merged
gnalbandyan merged 4 commits intomainfrom
gnalbandyan/lcb
Dec 9, 2025
Merged

Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings.#1079
gnalbandyan merged 4 commits intomainfrom
gnalbandyan/lcb

Conversation

@gnalbandyan
Copy link
Collaborator

@gnalbandyan gnalbandyan commented Dec 5, 2025

Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings.
Change order of MCQ extraction, now regex is first (no change is expected for the default MCQ evals, needed for prompts with custom answer formatting instruction)

Summary by CodeRabbit

  • New Features

    • Added LiveCodeBench dataset support to robustness evaluation.
    • Introduced 7 new code generation prompts for Python tasks.
  • Documentation

    • Removed Consistency Rate metrics from robustness evaluation documentation.
    • Updated coverage information for Math, MCQ, and LiveCodeBench datasets.
  • Bug Fixes

    • Improved answer extraction method prioritization in evaluation graders.
    • Enhanced metric handling in robustness summaries to support multiple metric types.

✏️ Tip: You can customize this high-level summary in your review settings.

…ss generic

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
@gnalbandyan gnalbandyan requested a review from ekmb December 5, 2025 11:31
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

This PR removes Consistency Rate (CR) metrics from robustness evaluation, adds LiveCodeBench support with new prompt configurations, updates answer extraction precedence logic in math/MCQ evaluators, modifies the robustness pipeline's metric selection and file discovery, and updates documentation accordingly.

Changes

Cohort / File(s) Summary
Answer extraction order
nemo_skills/evaluation/evaluator/mcq.py, nemo_skills/evaluation/math_grader.py
Updated extraction precedence when relaxed mode is enabled: regex-first followed by boxed (previously boxed-first). MCQ file updates documentation only; math_grader makes corresponding functional change.
Robustness evaluation infrastructure
nemo_skills/pipeline/robust_eval.py, nemo_skills/pipeline/summarize_robustness.py
Modified robust_eval to use shell-safe escaping for extract_regex arguments via shlex; updated signature to include prompt_set_config parameter. Refactored summarize_robustness to: remove CR metrics entirely, change metric selection to prioritize judge_correct/symbolic_correct/accuracy, update file discovery patterns (eval-results/*/output*.jsonl), derive benchmark from parent directory, and remove consistency calculation logic.
Documentation
docs/evaluation/robustness.md
Removed CR metrics and explanatory subsections; expanded dataset coverage (10 Math, 10 MCQ, 7 LiveCodeBench prompts); specified prompt paths (math_prompts, mcq_prompts subdirectories); updated output examples and capability descriptions.
LiveCodeBench prompt configurations
nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml, code_1.yaml, code_2.yaml, code_3.yaml, code_4.yaml, ns_gen_codegen.yaml, ns_python_codegen.yaml
Added 7 new YAML prompt templates for LiveCodeBench code generation tasks, each with distinct instruction sets and placeholder {question} variables.
Prompt set configuration
nemo_skills/prompt/config/robustness/prompt_set_config.yaml
Added livecodebench section under both gpqa and comp-math-24-25 blocks, referencing all 7 new code prompt configurations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas requiring attention:
    • nemo_skills/pipeline/summarize_robustness.py: Metric selection logic change (prioritizing judge_correct/symbolic_correct/accuracy), file discovery pattern modifications, and removal of CR/consistency calculations—verify correctness of fallback behavior and default values (-1 for missing metrics)
    • nemo_skills/pipeline/robust_eval.py: Shell escaping logic with shlex for extract_regex arguments; confirm proper quoting for complex regex patterns
    • Answer extraction precedence reversal: Ensure regex-first order is intentional and tested across both math and MCQ evaluation paths

Possibly related PRs

Suggested reviewers

  • ekmb
  • Kipok

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately summarizes all major changes in the PR: adding LCB prompts, fixing the regex bug in robust_eval, removing CR metrics, making summarize_robustness generic, and updating docstrings.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gnalbandyan/lcb

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
nemo_skills/pipeline/robust_eval.py (1)

16-17: Re-check double shlex.quote and quoting strategy for extract_regex

The new logic

hydra_arg = f"++eval_config.extract_regex='{prompt.extract_regex}'"
shell_safe_arg = shlex.quote(shlex.quote(hydra_arg))
prompt_context.args.append(shell_safe_arg)

adds two layers of shell quoting around an argument that already contains single quotes. Depending on how _eval builds the actual command, this can result in literal quotes reaching Hydra or break overrides when the regex itself contains quotes.

If you truly have two shell layers, the double quoting may be required, but it’s easy to get wrong. It might be safer to:

  • encode the value with json.dumps(prompt.extract_regex) and
  • apply only the minimal number of shlex.quote calls actually needed by the command-construction path.

At minimum, please verify with a regex containing spaces, parentheses, backslashes, and quotes that eval_config.extract_regex seen by Hydra matches the raw pattern you expect.

Also applies to: 119-123

nemo_skills/pipeline/summarize_robustness.py (1)

59-75: Metric extraction per file is more general; consider guarding missing pass@1

Using ComputeMetrics(benchmark=Path(pred_file).parent.name, max_samples=-1) and then preferring judge_correct, symbolic_correct, or accuracy out of metrics["pass@1"] makes get_metrics work across math, MCQ, and judge-based benchmarks while falling back to -1 when no suitable accuracy metric is present. That’s a solid generalization.

If there is any chance a calculator omits "pass@1", you might want to use metrics.get("pass@1", {}) to avoid a KeyError, but if all relevant calculators guarantee it, the current code is fine.

nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml (1)

1-6: Consider tightening the example code block to avoid literal placeholders

The example:

```python
[Insert the Python code here.]

may encourage some models to emit the placeholder literally. You might instead show a minimal but real-looking stub (e.g., `# Your code here`) or omit the inner line entirely to emphasize that the block should contain executable code.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml (1)</summary><blockquote>

`1-8`: **Minor wording tweak to clarify final code-block requirement**

The line:

> You must use ```python for just the final solution code block with the following format:

is slightly awkward. Consider something like:

> Format the final solution as a single ```python code block with the following structure:

to make it clear that only the final answer should be in that block, without changing the intended behavior.

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between c1bf75578ba1d1ecd65315949b22419556f218ef and 4a735ec3e1fa1b4c0fcc4ceb16f01dc696c6c234.

</details>

<details>
<summary>📒 Files selected for processing (13)</summary>

* `docs/evaluation/robustness.md` (1 hunks)
* `nemo_skills/evaluation/evaluator/mcq.py` (1 hunks)
* `nemo_skills/evaluation/math_grader.py` (1 hunks)
* `nemo_skills/pipeline/robust_eval.py` (2 hunks)
* `nemo_skills/pipeline/summarize_robustness.py` (4 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml` (1 hunks)
* `nemo_skills/prompt/config/robustness/prompt_set_config.yaml` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>nemo_skills/pipeline/summarize_robustness.py (2)</summary><blockquote>

<details>
<summary>nemo_skills/evaluation/metrics/compute_metrics.py (2)</summary>

* `ComputeMetrics` (24-103)
* `get_metrics_calculator` (54-57)

</details>
<details>
<summary>nemo_skills/evaluation/metrics/utils.py (1)</summary>

* `read_predictions` (23-33)

</details>

</blockquote></details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/evaluation/robustness.md</summary>

[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...ed for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/...

(ETC_PERIOD)

---

[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...q_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. - robust_eval can be used ...

(ETC_PERIOD)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)</summary>

* GitHub Check: unit-tests
* GitHub Check: pre-commit

</details>

<details>
<summary>🔇 Additional comments (10)</summary><blockquote>

<details>
<summary>nemo_skills/evaluation/evaluator/mcq.py (1)</summary><blockquote>

`34-37`: **Relaxed-mode comment now matches actual regex-first behavior**

The updated comment correctly documents that relaxed extraction tries regex first and falls back to boxed, which matches `extract_answer` in `math_grader.py`. No further changes needed.

</blockquote></details>
<details>
<summary>nemo_skills/evaluation/math_grader.py (1)</summary><blockquote>

`106-111`: **Regex-first relaxed extraction is consistent and well-documented**

When `relaxed=True`, `extract_answer` now prefers regex and falls back to boxed, and the docstring states this explicitly. This keeps behavior aligned with MCQ evaluation and should make regex-based answer formatting more robust.

</blockquote></details>
<details>
<summary>nemo_skills/pipeline/summarize_robustness.py (1)</summary><blockquote>

`171-178`: **Updated result discovery and `prompt_sensitivity` aggregation align with new layouts**

The new glob patterns for benchmarks and prompts (looking under `**/eval-results/*/output*jsonl` but excluding `output.jsonl` and chunked files) match the expected robust-eval layout and avoid double-counting aggregated outputs.

Computing `prompt_sensitivity` as the std of per-prompt averages and reflecting that in the top-level header (`["min", "max", "avg", "std", "prompt_sensitivity"]`) produces a concise, more interpretable robustness summary, while keeping per-prompt rows focused on `["min", "max", "avg", "std", "no_answer"]`. The tabular formatting logic looks consistent with these changes.



Also applies to: 192-194, 220-223, 231-235, 239-253

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml (1)</summary><blockquote>

`1-12`: **Clear default prompt for Python codegen completion**

This prompt cleanly constrains the task (complete the given function in Python without modifying existing code) and specifies an unambiguous output format in a ` ```python ` block. Looks good for reuse across Python-based code benchmarks.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml (1)</summary><blockquote>

`1-7`: **Concise default LiveCodeBench Python prompt**

The prompt succinctly asks for executable Python code and leaves room for the benchmark to provide detailed instructions via `{question}`. The header comment documenting its origin is also helpful. No changes needed.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml (1)</summary><blockquote>

`1-3`: **LGTM!**

Simple and clear prompt structure with proper YAML syntax and placeholder usage.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml (1)</summary><blockquote>

`1-10`: **LGTM!**

Well-documented prompt configuration with helpful context comments explaining design choices and external methodology alignment. The comment on line 4 provides useful integration information with `prepare.py`.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml (1)</summary><blockquote>

`1-19`: **LGTM!**

Well-structured prompt with clear guidelines, helpful example format, and comprehensive instructions for code generation tasks. The multi-section organization (problem description, response guidelines, example) is pedagogically sound.

</blockquote></details>
<details>
<summary>docs/evaluation/robustness.md (1)</summary><blockquote>

`85-105`: **Documentation correctly reflects metric changes.**

Tables properly remove CR column and add `no_answer` metric, aligning with the robustness evaluation changes. Updated structure matches the PR's removal of Consistency Rate metrics.

</blockquote></details>
<details>
<summary>nemo_skills/prompt/config/robustness/prompt_set_config.yaml (1)</summary><blockquote>

`33-40`: **LGTM!**

Properly structured addition of the livecodebench prompt configuration section with 7 prompts, following established YAML conventions and matching the count referenced in documentation.


Ensure that all referenced code prompt files exist: `ns_gen_codegen.yaml`, `ns_python_codegen.yaml`, `code_2.yaml`, and `code_3.yaml` are not shown in this review but are referenced in lines 35-39.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +109 to +110
- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add periods after abbreviations in American English.

The abbreviation "etc" appears twice without trailing periods at line 109, which is required in American English style. Additionally, line 109 could be clearer about the prompt folder organization.

Apply this diff to fix the style issues:

- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
+ There are 10 Math, 10 MCQ, and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc.), benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc.) benchmarks.

Note: Changed "and 7" to ", and 7" (Oxford comma), and added periods after both instances of "etc."

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 LanguageTool

[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...ed for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/...

(ETC_PERIOD)


[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...q_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. - robust_eval can be used ...

(ETC_PERIOD)

🤖 Prompt for AI Agents
In docs/evaluation/robustness.md around lines 109 to 110, update the sentence
describing the prompt counts and folder usage to use American English
punctuation and clearer folder organization: add periods after both instances of
"etc." and insert the Oxford comma (change "and 7" to ", and 7"); reword the
sentence slightly to make clear which folders map to which benchmark types
(e.g., state that prompt/config/robustness/math_prompts is for Math benchmarks
and prompt/config/robustness/mcq_prompts is for MCQ benchmarks) so the
folder-to-benchmark mapping is explicit.

@ekmb ekmb requested a review from titu1994 December 8, 2025 13:59
@ekmb
Copy link
Collaborator

ekmb commented Dec 8, 2025

@titu1994 could you please review LCB robustness prompts?

@ekmb ekmb changed the title Gnalbandyan/lcb Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings. Dec 9, 2025
@gnalbandyan gnalbandyan merged commit 774cef6 into main Dec 9, 2025
5 checks passed
@gnalbandyan gnalbandyan deleted the gnalbandyan/lcb branch December 9, 2025 18:47
Jorjeous pushed a commit that referenced this pull request Dec 11, 2025
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 12, 2025
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 8, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants