Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings.#1079
Conversation
…ss generic Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
📝 WalkthroughWalkthroughThis PR removes Consistency Rate (CR) metrics from robustness evaluation, adds LiveCodeBench support with new prompt configurations, updates answer extraction precedence logic in math/MCQ evaluators, modifies the robustness pipeline's metric selection and file discovery, and updates documentation accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
nemo_skills/pipeline/robust_eval.py (1)
16-17: Re-check doubleshlex.quoteand quoting strategy forextract_regexThe new logic
hydra_arg = f"++eval_config.extract_regex='{prompt.extract_regex}'" shell_safe_arg = shlex.quote(shlex.quote(hydra_arg)) prompt_context.args.append(shell_safe_arg)adds two layers of shell quoting around an argument that already contains single quotes. Depending on how
_evalbuilds the actual command, this can result in literal quotes reaching Hydra or break overrides when the regex itself contains quotes.If you truly have two shell layers, the double quoting may be required, but it’s easy to get wrong. It might be safer to:
- encode the value with
json.dumps(prompt.extract_regex)and- apply only the minimal number of
shlex.quotecalls actually needed by the command-construction path.At minimum, please verify with a regex containing spaces, parentheses, backslashes, and quotes that
eval_config.extract_regexseen by Hydra matches the raw pattern you expect.Also applies to: 119-123
nemo_skills/pipeline/summarize_robustness.py (1)
59-75: Metric extraction per file is more general; consider guarding missingpass@1Using
ComputeMetrics(benchmark=Path(pred_file).parent.name, max_samples=-1)and then preferringjudge_correct,symbolic_correct, oraccuracyout ofmetrics["pass@1"]makesget_metricswork across math, MCQ, and judge-based benchmarks while falling back to-1when no suitable accuracy metric is present. That’s a solid generalization.If there is any chance a calculator omits
"pass@1", you might want to usemetrics.get("pass@1", {})to avoid aKeyError, but if all relevant calculators guarantee it, the current code is fine.nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml (1)
1-6: Consider tightening the example code block to avoid literal placeholdersThe example:
```python [Insert the Python code here.]may encourage some models to emit the placeholder literally. You might instead show a minimal but real-looking stub (e.g., `# Your code here`) or omit the inner line entirely to emphasize that the block should contain executable code. </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml (1)</summary><blockquote> `1-8`: **Minor wording tweak to clarify final code-block requirement** The line: > You must use ```python for just the final solution code block with the following format: is slightly awkward. Consider something like: > Format the final solution as a single ```python code block with the following structure: to make it clear that only the final answer should be in that block, without changing the intended behavior. </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between c1bf75578ba1d1ecd65315949b22419556f218ef and 4a735ec3e1fa1b4c0fcc4ceb16f01dc696c6c234. </details> <details> <summary>📒 Files selected for processing (13)</summary> * `docs/evaluation/robustness.md` (1 hunks) * `nemo_skills/evaluation/evaluator/mcq.py` (1 hunks) * `nemo_skills/evaluation/math_grader.py` (1 hunks) * `nemo_skills/pipeline/robust_eval.py` (2 hunks) * `nemo_skills/pipeline/summarize_robustness.py` (4 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/code_2.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/code_3.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml` (1 hunks) * `nemo_skills/prompt/config/robustness/prompt_set_config.yaml` (1 hunks) </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧬 Code graph analysis (1)</summary> <details> <summary>nemo_skills/pipeline/summarize_robustness.py (2)</summary><blockquote> <details> <summary>nemo_skills/evaluation/metrics/compute_metrics.py (2)</summary> * `ComputeMetrics` (24-103) * `get_metrics_calculator` (54-57) </details> <details> <summary>nemo_skills/evaluation/metrics/utils.py (1)</summary> * `read_predictions` (23-33) </details> </blockquote></details> </details><details> <summary>🪛 LanguageTool</summary> <details> <summary>docs/evaluation/robustness.md</summary> [style] ~109-~109: In American English, abbreviations like “etc.” require a period. Context: ...ed for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/... (ETC_PERIOD) --- [style] ~109-~109: In American English, abbreviations like “etc.” require a period. Context: ...q_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. - robust_eval can be used ... (ETC_PERIOD) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)</summary> * GitHub Check: unit-tests * GitHub Check: pre-commit </details> <details> <summary>🔇 Additional comments (10)</summary><blockquote> <details> <summary>nemo_skills/evaluation/evaluator/mcq.py (1)</summary><blockquote> `34-37`: **Relaxed-mode comment now matches actual regex-first behavior** The updated comment correctly documents that relaxed extraction tries regex first and falls back to boxed, which matches `extract_answer` in `math_grader.py`. No further changes needed. </blockquote></details> <details> <summary>nemo_skills/evaluation/math_grader.py (1)</summary><blockquote> `106-111`: **Regex-first relaxed extraction is consistent and well-documented** When `relaxed=True`, `extract_answer` now prefers regex and falls back to boxed, and the docstring states this explicitly. This keeps behavior aligned with MCQ evaluation and should make regex-based answer formatting more robust. </blockquote></details> <details> <summary>nemo_skills/pipeline/summarize_robustness.py (1)</summary><blockquote> `171-178`: **Updated result discovery and `prompt_sensitivity` aggregation align with new layouts** The new glob patterns for benchmarks and prompts (looking under `**/eval-results/*/output*jsonl` but excluding `output.jsonl` and chunked files) match the expected robust-eval layout and avoid double-counting aggregated outputs. Computing `prompt_sensitivity` as the std of per-prompt averages and reflecting that in the top-level header (`["min", "max", "avg", "std", "prompt_sensitivity"]`) produces a concise, more interpretable robustness summary, while keeping per-prompt rows focused on `["min", "max", "avg", "std", "no_answer"]`. The tabular formatting logic looks consistent with these changes. Also applies to: 192-194, 220-223, 231-235, 239-253 </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/code_prompts/ns_gen_codegen.yaml (1)</summary><blockquote> `1-12`: **Clear default prompt for Python codegen completion** This prompt cleanly constrains the task (complete the given function in Python without modifying existing code) and specifies an unambiguous output format in a ` ```python ` block. Looks good for reuse across Python-based code benchmarks. </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/code_prompts/ns_python_codegen.yaml (1)</summary><blockquote> `1-7`: **Concise default LiveCodeBench Python prompt** The prompt succinctly asks for executable Python code and leaves room for the benchmark to provide detailed instructions via `{question}`. The header comment documenting its origin is also helpful. No changes needed. </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/code_prompts/code_1.yaml (1)</summary><blockquote> `1-3`: **LGTM!** Simple and clear prompt structure with proper YAML syntax and placeholder usage. </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/code_prompts/aai_prompt.yaml (1)</summary><blockquote> `1-10`: **LGTM!** Well-documented prompt configuration with helpful context comments explaining design choices and external methodology alignment. The comment on line 4 provides useful integration information with `prepare.py`. </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/code_prompts/code_4.yaml (1)</summary><blockquote> `1-19`: **LGTM!** Well-structured prompt with clear guidelines, helpful example format, and comprehensive instructions for code generation tasks. The multi-section organization (problem description, response guidelines, example) is pedagogically sound. </blockquote></details> <details> <summary>docs/evaluation/robustness.md (1)</summary><blockquote> `85-105`: **Documentation correctly reflects metric changes.** Tables properly remove CR column and add `no_answer` metric, aligning with the robustness evaluation changes. Updated structure matches the PR's removal of Consistency Rate metrics. </blockquote></details> <details> <summary>nemo_skills/prompt/config/robustness/prompt_set_config.yaml (1)</summary><blockquote> `33-40`: **LGTM!** Properly structured addition of the livecodebench prompt configuration section with 7 prompts, following established YAML conventions and matching the count referenced in documentation. Ensure that all referenced code prompt files exist: `ns_gen_codegen.yaml`, `ns_python_codegen.yaml`, `code_2.yaml`, and `code_3.yaml` are not shown in this review but are referenced in lines 35-39. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| - There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. | ||
| - robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work. |
There was a problem hiding this comment.
Add periods after abbreviations in American English.
The abbreviation "etc" appears twice without trailing periods at line 109, which is required in American English style. Additionally, line 109 could be clearer about the prompt folder organization.
Apply this diff to fix the style issues:
- There are 10 Math, 10 MCQ and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks.
- robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math, MCQ, LiveCodeBench datasets and any dataset with judge evaluation (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the `summarize_robustness` part won't work.
+ There are 10 Math, 10 MCQ, and 7 LiveCodeBench prompts in the `prompt/config/robustness` folder, along with the prompt_set_config.yaml. Those prompts vary by prompt wording and problem placement. MCQ prompts also vary by answer formatting instruction, while Math prompts use only \boxed{} format. `prompt/config/robustness/math_prompts` can be used for any Math (AIME, comp-math-24-25, etc.), benchmarks, `prompt/config/robustness/mcq_prompts` for any MCQ (GPQA, MMLU-Pro, etc.) benchmarks.Note: Changed "and 7" to ", and 7" (Oxford comma), and added periods after both instances of "etc."
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 LanguageTool
[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...ed for any Math (AIME, comp-math-24-25, etc) benchmarks, `prompt/config/robustness/...
(ETC_PERIOD)
[style] ~109-~109: In American English, abbreviations like “etc.” require a period.
Context: ...q_prompts` for any MCQ (GPQA, MMLU-Pro, etc) benchmarks. - robust_eval can be used ...
(ETC_PERIOD)
🤖 Prompt for AI Agents
In docs/evaluation/robustness.md around lines 109 to 110, update the sentence
describing the prompt counts and folder usage to use American English
punctuation and clearer folder organization: add periods after both instances of
"etc." and insert the Oxford comma (change "and 7" to ", and 7"); reword the
sentence slightly to make clear which folders map to which benchmark types
(e.g., state that prompt/config/robustness/math_prompts is for Math benchmarks
and prompt/config/robustness/mcq_prompts is for MCQ benchmarks) so the
folder-to-benchmark mapping is explicit.
|
@titu1994 could you please review LCB robustness prompts? |
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings.
Change order of MCQ extraction, now regex is first (no change is expected for the default MCQ evals, needed for prompts with custom answer formatting instruction)
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.