Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True by Kipok · Pull Request #958 · NVIDIA-NeMo/Skills

Kipok · 2025-10-17T05:05:41Z

evaluate_results.py is removed, it's logic is moved inside generation class
remove_thinking=True doesn't check for beginning tag anymore (it's not set for many models, so it's not reliable). This means that if end of thinking isn't present we set generation to empty. Which in turn means that setting remove_thinking=True for non-reasoning models will be disastrous and thus the default is changed to False. It's now necessary to set it explicitly (or use a reasoning-parser on the server side) for all reasoning models if thinking shouldn't be considered in evaluation
EVAL_ARGS are removed from all inits and moved into generation args
extra_eval_args / eval_args parameters are removed (now specified as normal generation args)
remove unused batch_evaluate_results and leanstatement evaluator (addresses Remove the Lean4 Statement Check Logic and Lean Proof logic from batch_evaluate_results #840)
remove ++eval_type=no-op (now same behavior is achieved by simply not specifying eval_type)
the top-level evaluation config is fused with inner "eval_config", so all parameters are available inside evaluator classes, including input_files, data_dir and split (fyi @SeanNaren)
input_files is replaced with a single input_file everywhere in evaluation. It's now assumed it's being run on a single input only (which was already the case for a while anyway).
rename apply_evaluation_hook -> evaluate_single_datapoint
add new postprocess_single_output that has logic previously in dump_outputs that renames/removes certain fields. This is now called before interleaved evaluation resolving eval_single / eval_full should support a few more features like pre/postprocessing and remove-thinking #841
parse_eval_args logic is fully removed as it's not needed anymore (batch and interleaved evals are handled in the same place and with the same parameters)
remove_thinking now saves full generation into _full_{generation_key} which allows not overriding _full_generation in judge jobs
remove_thinking is renamed to parse_reasoning and thinking_end is renamed to end_reasoning_string to make the usage clearer

Summary by CodeRabbit

New Features
- Per-datapoint evaluation can run during generation.
Documentation
- CLI examples modernized to use unified ++-prefixed args and many examples add ++remove_thinking=True; guidance warns about reasoning-enabled evaluation.
Refactor
- Evaluation/config flags consolidated into generation arguments (e.g., ++eval_type moved into generation args); command syntax simplified.
Behavior change
- Thinking-token handling standardized and defaults adjusted (remove_thinking behavior changed).

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2025-10-17T05:05:51Z

Walkthrough

Consolidates evaluation/config flags into generation arguments, removes the standalone evaluate_results module, introduces a BaseEvaluatorConfig and single-file evaluator workflows, centralizes remove_thinking behavior (dropping thinking_begin), updates generation to support per-datapoint evaluation hooks, and updates many docs and examples to use ++remove_thinking and new ++eval_config/++eval_type CLI syntax.

Changes

Cohort / File(s)	Summary
Docs: examples & flags `docs/...` (multiple files, e.g., `docs/evaluation/`, `docs/basics/index.md`, `docs/agentic_inference/parallel_thinking.md`, `docs/pipelines/generation.md`, `docs/releases/`, `docs/tutorials/posts/`, `recipes/`)	Added `++remove_thinking=True` in many examples; replaced `--eval_args="..."` / `--extra_eval_args="..."` wrappers with direct `++eval_type` / `++eval_config.*` flags; adjusted CLI examples and explanatory text.
Datasets: merge EVAL_ARGS → GENERATION_ARGS `nemo_skills/dataset/*/__init__.py`, `.../prepare.py` (many modules)	Removed module-level `EVAL_ARGS` (and some PROMPT_CONFIG) and appended equivalent `++eval_type=...` / eval-config fragments into `GENERATION_ARGS`; special-case `swe-bench` changed to `GENERATION_MODULE`.
Evaluation: remove module & refactor evaluators `nemo_skills/evaluation/evaluate_results.py`, `nemo_skills/evaluation/evaluator/*`, `nemo_skills/evaluation/math_grader.py`	Deleted `evaluate_results.py`; removed helpers (`unroll_files`, `evaluate_result`); added `BaseEvaluatorConfig`; refactored many evaluators to single-input-file processing and new function/class signatures; changed top-level evaluate API to `evaluate(eval_type, eval_config)`.
Evaluator configs & APIs `nemo_skills/evaluation/evaluator/*.py` (arena, base, bfcl, code, ifbench, ifeval, ioi, livecodebench, math, mcq, mrcr, ojbench, ruler, scicode, ...)	Introduced/used `BaseEvaluatorConfig`; evaluator config classes now inherit from it; `eval_full` / async signatures simplified to operate on `input_file`/single-file workflows; small config fields added/changed (e.g., num_retries, overwrite).
Sandbox / code-exec `nemo_skills/code_execution/sandbox.py`	Removed `batch_evaluate_results`, `unroll_files` usage, and related proof-eval helpers; dropped tqdm/progress and some imports.
Generation & inference: thinking & eval hooks `nemo_skills/inference/*`, `nemo_skills/inference/generate.py`, `nemo_skills/inference/model/parallel_thinking.py`	Removed `thinking_begin` from configs and call sites; changed ParallelThinkingConfig default for `remove_thinking`; added per-datapoint hooks (`postprocess_single_output`, `evaluate_single_datapoint`) and a batch fallback (`run_batch_evaluation`); wiring to initialize and call evaluators.
Pipeline & CLI: arg consolidation `nemo_skills/pipeline/.py`, `nemo_skills/pipeline/utils/`	Removed `eval_args` / `extra_eval_args` options and parsing helpers; consolidated evaluation config into `generation_args`; simplified `get_generation_cmd` to drop eval-specific branch and to append extra arguments only when present.
Utils: remove_thinking & logging `nemo_skills/utils.py`	Rewrote `remove_thinking` to drop `thinking_begin`, use dynamic `_full_{generation_key}`, warn when end-tag missing; added `get_logger_name()` and module logger.
Inference eval modules: config changes `nemo_skills/inference/eval/{bfcl,scicode,swebench}.py`	Removed dataclass `remove_thinking`/`thinking_begin` fields in places; adjusted calls to `remove_thinking`; added eval-related config fields and renamed hooks (e.g., `evaluate_single_datapoint`).
Tests & scripts `tests/`, `tests/slurm-tests/`, `tests/gpu-tests/`, `tests/test_configs.py`, `tests/slurm-tests/run_all.sh`, `recipes/`	Updated tests and slurm scripts to new CLI syntax (`++eval_type`, `++remove_thinking=True`); removed a test referencing `evaluate_results`; added `RUN_NAME` and sequential execution in `run_all.sh`; minor test adjustments.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Pipeline
    participant Generator
    participant Evaluator
    participant Sandbox

    User->>CLI: run generate with ++eval_type / ++eval_config / ++remove_thinking
    CLI->>Pipeline: parse generation_args (includes eval config)
    Pipeline->>Generator: start generation task (per-datapoint)
    alt per-datapoint evaluation enabled
        Generator->>Sandbox: generate datapoint
        Generator->>Generator: postprocess_single_output(datapoint)
        Generator->>Evaluator: evaluate_single_datapoint(datapoint)
        Evaluator->>Sandbox: run evaluation (single-file or per-datapoint)
        Evaluator-->>Generator: metrics
        Generator->>Pipeline: write augmented output
    else batch-only evaluation
        loop generate datapoints
            Generator->>Sandbox: generate
            Generator->>Generator: postprocess_single_output()
            Generator->>Pipeline: append output
        end
        Pipeline->>Evaluator: run_batch_evaluation(output_file)
        Evaluator->>Sandbox: run evaluation on file
        Evaluator-->>Pipeline: results
    end
    Pipeline-->>User: finished

sequenceDiagram
    participant Old as "Old evaluate_results"
    participant New as "New evaluator path"
    participant Eval as "Evaluator"

    Old->>Eval: evaluate(cfg)  -- (removed module)
    Note right of New: New flow embeds eval into generation args\nand uses BaseEvaluatorConfig per-file
    New->>Eval: evaluate(eval_type, eval_config)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through flags and stitched them well,
Merged evals to gen where quiet configs dwell,
Trimmed thinking-tags, one-file leaps ahead,
Docs and scripts refreshed, the pathways led,
A tidy warren of flags — I humbly tread.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True" directly addresses the two primary changes in this changeset. The first part—moving evaluation inside the generation class—is clearly demonstrated by the removal of evaluate_results.py and the integration of evaluation logic into the generation class with new methods like evaluate_single_datapoint and run_batch_evaluation. The second part—enforcing empty generations when remove_thinking=True—is reflected in the updated remove_thinking behavior in utils.py that sets generation to empty when the thinking_end marker is not found. The title is concise, specific, and clearly conveys the main architectural and behavioral changes that a teammate scanning the PR history would understand.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/reasoning-on-arg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

nemo_skills/inference/model/parallel_thinking.py (1)
221-237: Critical: Accessing removed thinking_begin field causes AttributeError.

Lines 227-228 reference self.cfg.thinking_begin, but this field was removed from ParallelThinkingConfig per the PR objectives. This will cause a runtime AttributeError when filter_incomplete_solutions is True (the default).

The filtering logic needs to be updated to align with the new approach that only checks for thinking_end. Based on the PR objectives and the remove_thinking utility behavior, incomplete solutions should be identified by the absence of thinking_end when reasoning is expected.

Apply this diff to fix the logic:
         # Filter out incomplete solutions if specified
         if self.cfg.filter_incomplete_solutions:
             # Remove unfinished solutions
             filtered_solutions = []
             for solution in solutions:
-                # Check if thinking_begin is in the solution and thinking_end is not in the solution
-                if (
-                    self.cfg.thinking_begin in solution[self.cfg.solution_key]
-                    and self.cfg.thinking_end not in solution[self.cfg.solution_key]
-                ):
-                    continue
-                else:
-                    filtered_solutions.append(solution)
+                # Check if thinking_end is not in the solution when remove_thinking is enabled
+                if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
+                    # Incomplete solution - missing thinking_end marker
+                    continue
+                filtered_solutions.append(solution)
 
             if len(filtered_solutions) < len(solutions):
                 LOG.info(f"Filtered out {len(solutions) - len(filtered_solutions)} incomplete solutions")
Note: This fix assumes incomplete solutions should only be filtered when remove_thinking=True, since that's when the thinking_end marker is expected. You may want to verify this logic aligns with your intended behavior.
nemo_skills/inference/eval/bfcl.py (1)
386-391: Critical: Reference to removed field will cause runtime error.

According to the AI summary, the remove_thinking field was removed from BFCLGenerationConfig, but Line 386 still references self.cfg.remove_thinking. This will raise an AttributeError at runtime when this code path is executed.

The TODO comment on Line 387 acknowledges this needs refactoring. Please either:

Keep the remove_thinking field in BFCLGenerationConfig (revert its removal), or

Remove this entire conditional block and the _remove_thinking_from_message_content call, or

Access remove_thinking from a different source (e.g., parent class or generation config)

Apply one of these fixes before merging:

Option 1: Keep the field (if needed for BFCL)
 @nested_dataclass(kw_only=True)
 class BFCLGenerationConfig(GenerateSolutionsConfig):
     """BFCL benchmark generation."""
+    remove_thinking: bool = False
Option 2: Remove the conditional (if not needed)
-                if self.cfg.remove_thinking:
-                    # TODO: replace with main remove_thinking method
-                    trimmed_response_text = self._remove_thinking_from_message_content(
-                        self.message_parser.get_response_text(model_response["message"])
-                    )
-                    self.message_parser.set_response_text(model_response["message"], trimmed_response_text)
nemo_skills/inference/generate.py (1)
642-646: Prefilled outputs bypass postprocess/single-eval and lose _async_position → restore_async_order will break

You dump prefilled outputs directly without running postprocess_single_output. Since dump_outputs no longer merges original fields, lines written to -async won’t include the async position key and schema normalization. This will cause a KeyError in restore_async_order and also skip interleaved eval for prefilled rows.

Apply postprocess (and single-eval when enabled) before writing prefilled outputs:
-            # Dump prefilled data first
-            if len(prefilled_data_points) > 0:
-                async with self.output_lock:
-                    self.dump_outputs(prefilled_outputs, prefilled_data_points, fout)
+            # Dump prefilled data first (normalize schema and optionally interleave eval)
+            if len(prefilled_data_points) > 0:
+                prepared_outputs = []
+                for dp, out in zip(prefilled_data_points, prefilled_outputs):
+                    await self.postprocess_single_output(out, dp)
+                    if self.should_run_evaluation and self.evaluator:
+                        out = await self.evaluate_single_datapoint({**dp, **out})
+                    prepared_outputs.append(out)
+                async with self.output_lock:
+                    self.dump_outputs(prepared_outputs, prefilled_data_points, fout)

🧹 Nitpick comments (9)

nemo_skills/evaluation/evaluator/base.py (1)
55-60: Consider streaming task creation for large files.

All tasks are created upfront (lines 57-60) before any awaiting, which could consume significant memory for large evaluation files. While the semaphore limits concurrent execution, all task objects are instantiated immediately. A streaming approach (creating tasks in batches) could reduce memory pressure.

However, the comment on Line 46 explicitly acknowledges this tradeoff, and the current approach is simpler and likely sufficient for typical evaluation file sizes.

If needed, here's a batch-based approach:
# In the reading section:
lines = []
with open(self.config.input_file, "rt", encoding="utf-8") as fin:
    lines = [json.loads(line) for line in fin]

# Create and process in batches
temp_file = self.config.input_file + "-tmp"
batch_size = 1000  # Adjust based on needs
with open(temp_file, "wt", encoding="utf-8") as f:
    for i in tqdm.tqdm(range(0, len(lines), batch_size), desc=f"Completed Evaluation for {os.path.basename(self.config.input_file)}"):
        batch = lines[i:i+batch_size]
        tasks = [asyncio.create_task(process_line(line)) for line in batch]
        for task in await asyncio.gather(*tasks):
            f.write(json.dumps(task) + "\n")
nemo_skills/evaluation/evaluator/scicode.py (1)
132-139: Consider using atomic write pattern for data integrity.

The current implementation overwrites the input file directly during evaluation. If the process crashes or is interrupted while writing, the original data could be corrupted or partially lost.

Consider adopting the atomic write pattern used in base.py (lines 60-67 in the relevant snippets):
-    jsonl_file = eval_config.input_file
-    with open(jsonl_file, "rt", encoding="utf-8") as fin:
-        data = [json.loads(line) for line in fin]
-    status_lists = test_code(eval_config, data)
-    with open(jsonl_file, "wt", encoding="utf-8") as fout:
-        for idx, elem in enumerate(data):
-            elem["eval_status"] = status_lists[idx]
-            fout.write(json.dumps(elem) + "\n")
+    jsonl_file = eval_config.input_file
+    with open(jsonl_file, "rt", encoding="utf-8") as fin:
+        data = [json.loads(line) for line in fin]
+    status_lists = test_code(eval_config, data)
+    
+    temp_file = jsonl_file + "-tmp"
+    with open(temp_file, "wt", encoding="utf-8") as fout:
+        for idx, elem in enumerate(data):
+            elem["eval_status"] = status_lists[idx]
+            fout.write(json.dumps(elem) + "\n")
+    
+    os.replace(temp_file, jsonl_file)
Don't forget to add import os at the top of the file if not already present.
nemo_skills/dataset/swe-bench/__init__.py (1)

15-20: Clarify the intent of removing GENERATION_ARGS from swe-bench.

The concern about breaking changes is overstated: EVAL_ARGS is unused throughout the codebase (zero references), and GENERATION_ARGS is fetched at line 126 in nemo_skills/pipeline/utils/eval.py with a default value of "", so missing the constant won't cause errors—it will simply default to an empty string.

However, removing GENERATION_ARGS creates an inconsistency: all 60+ other datasets define it in their __init__.py files. Either add GENERATION_ARGS to swe-bench to maintain consistency or document why swe-bench is an exception.
nemo_skills/evaluation/evaluator/ifbench.py (1)
21-59: Verify that sample lengths match before zipping.

The refactor to single-file workflow using BaseEvaluatorConfig is well-structured. However, the zip() calls at Lines 46 and 51 should use strict=True (Python 3.10+) to catch length mismatches between samples and eval_results.

Apply this diff to add strict validation:
-    for sample, eval_result in zip(samples, eval_results):
+    for sample, eval_result in zip(samples, eval_results, strict=True):
         sample["loose_eval"] = eval_result

     with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f:
         eval_results = [json.loads(line) for line in f]
-    for sample, eval_result in zip(samples, eval_results):
+    for sample, eval_result in zip(samples, eval_results, strict=True):
         sample["strict_eval"] = eval_result
nemo_skills/inference/generate.py (5)
336-344: Avoid double instantiation of evaluator (perf/side effects)

supports_single_eval() constructs an evaluator instance, and then you construct it again with get_evaluator_class(). Instantiate once and query supports_single_eval() on that instance.
@@
-from nemo_skills.evaluation.evaluator import (
-    evaluate,
-    get_evaluator_class,
-    supports_single_eval,
-)
+from nemo_skills.evaluation.evaluator import (
+    evaluate,
+    get_evaluator_class,
+)
@@
-        if self.should_run_evaluation:
-            self.cfg.eval_config = dict(self.cfg.eval_config)
-            if supports_single_eval(self.cfg.eval_type, self.cfg.eval_config):
-                LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.")
-                self.evaluator = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config)
+        if self.should_run_evaluation:
+            self.cfg.eval_config = dict(self.cfg.eval_config)
+            eval_inst = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config)
+            if eval_inst.supports_single_eval():
+                LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.")
+                self.evaluator = eval_inst
Also applies to: 35-39

537-540: Don’t mutate original_data_point during postprocess

Popping keys from the original data point introduces surprising side effects. Compute a filtered copy instead.
-        for key in output:
-            original_data_point.pop(key, None)
-        output.update(original_data_point)
+        # Avoid mutating the input datapoint; only merge non-conflicting keys
+        filtered_original = {k: v for k, v in original_data_point.items() if k not in output}
+        output.update(filtered_original)
589-595: Throttle per-datapoint evaluation to avoid unbounded concurrency

Interleaved eval runs concurrently across all tasks, independent of LLM request throttling. If evaluators are heavy (e.g., spawn subprocesses), this can exhaust resources.

Option A (simple): reuse the existing semaphore.
-        eval_results = await self.evaluator.eval_single(data_point)
+        async with self.semaphore:
+            eval_results = await self.evaluator.eval_single(data_point)
Option B (preferred): add a dedicated eval semaphore (e.g., cfg.eval_max_concurrent_requests) and guard with it here.

522-524: dump_outputs no longer uses data_points parameter

Now that merging is handled in postprocess_single_output, the data_points arg is unused. Consider removing it from the signature and call sites to reduce confusion.

729-731: Guard batch evaluation when output file is missing

If there’s no data and no existing output file, batch evaluation will likely fail. Add a simple existence check.
-        if self.should_run_evaluation and self.evaluator is None:
-            self.run_batch_evaluation()
+        if self.should_run_evaluation and self.evaluator is None:
+            if Path(self.cfg.output_file).exists():
+                self.run_batch_evaluation()
+            else:
+                LOG.info("Skipping batch evaluation: output file not found: %s", self.cfg.output_file)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd862d0 and 9bc6920.

📒 Files selected for processing (107)

docs/agentic_inference/parallel_thinking.md (3 hunks)
docs/basics/index.md (1 hunks)
docs/evaluation/code.md (2 hunks)
docs/evaluation/index.md (3 hunks)
docs/evaluation/multilingual.md (1 hunks)
docs/evaluation/natural-math.md (1 hunks)
docs/evaluation/robustness.md (1 hunks)
docs/evaluation/tool-calling.md (3 hunks)
docs/pipelines/generation.md (2 hunks)
docs/releases/opencodereasoning/evaluation.md (1 hunks)
docs/releases/openmathinstruct2/dataset.md (2 hunks)
docs/releases/openmathreasoning/evaluation.md (4 hunks)
docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md (8 hunks)
docs/tutorials/posts/nemotron-nano-v2-evals.md (6 hunks)
docs/tutorials/posts/omr-simple-recipe.md (1 hunks)
nemo_skills/code_execution/sandbox.py (1 hunks)
nemo_skills/dataset/aai/__init__.py (1 hunks)
nemo_skills/dataset/aalcr/__init__.py (1 hunks)
nemo_skills/dataset/aime24/__init__.py (1 hunks)
nemo_skills/dataset/aime25/__init__.py (1 hunks)
nemo_skills/dataset/algebra222/__init__.py (1 hunks)
nemo_skills/dataset/amc23/__init__.py (1 hunks)
nemo_skills/dataset/answer-judge/__init__.py (0 hunks)
nemo_skills/dataset/arena-hard/__init__.py (1 hunks)
nemo_skills/dataset/asdiv/__init__.py (1 hunks)
nemo_skills/dataset/beyond-aime/__init__.py (1 hunks)
nemo_skills/dataset/bfcl_v3/prepare.py (1 hunks)
nemo_skills/dataset/bigcodebench/__init__.py (1 hunks)
nemo_skills/dataset/brumo25/__init__.py (1 hunks)
nemo_skills/dataset/college_math/__init__.py (1 hunks)
nemo_skills/dataset/comp-math-24-25/__init__.py (1 hunks)
nemo_skills/dataset/flores200/__init__.py (1 hunks)
nemo_skills/dataset/gaokao2023en/__init__.py (1 hunks)
nemo_skills/dataset/gpqa/__init__.py (1 hunks)
nemo_skills/dataset/gsm-plus/__init__.py (1 hunks)
nemo_skills/dataset/gsm8k/__init__.py (1 hunks)
nemo_skills/dataset/hle/__init__.py (1 hunks)
nemo_skills/dataset/hmmt_feb25/__init__.py (1 hunks)
nemo_skills/dataset/human-eval-infilling/__init__.py (1 hunks)
nemo_skills/dataset/human-eval/__init__.py (1 hunks)
nemo_skills/dataset/ifbench/__init__.py (1 hunks)
nemo_skills/dataset/ifeval/__init__.py (1 hunks)
nemo_skills/dataset/ioi24/__init__.py (1 hunks)
nemo_skills/dataset/ioi25/__init__.py (1 hunks)
nemo_skills/dataset/livebench-coding/__init__.py (1 hunks)
nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
nemo_skills/dataset/livecodebench-pro/__init__.py (1 hunks)
nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
nemo_skills/dataset/math-500/__init__.py (1 hunks)
nemo_skills/dataset/math-odyssey/__init__.py (1 hunks)
nemo_skills/dataset/math/__init__.py (1 hunks)
nemo_skills/dataset/mawps/__init__.py (1 hunks)
nemo_skills/dataset/mbpp/__init__.py (1 hunks)
nemo_skills/dataset/minerva_math/__init__.py (1 hunks)
nemo_skills/dataset/minif2f/__init__.py (1 hunks)
nemo_skills/dataset/mmlu-pro/__init__.py (1 hunks)
nemo_skills/dataset/mmlu-prox/__init__.py (1 hunks)
nemo_skills/dataset/mmlu-redux/__init__.py (1 hunks)
nemo_skills/dataset/mmlu/__init__.py (1 hunks)
nemo_skills/dataset/mobench/__init__.py (1 hunks)
nemo_skills/dataset/mrcr/__init__.py (1 hunks)
nemo_skills/dataset/ojbench/__init__.py (1 hunks)
nemo_skills/dataset/olympiadbench/__init__.py (1 hunks)
nemo_skills/dataset/omni-math/__init__.py (1 hunks)
nemo_skills/dataset/proofnet/__init__.py (1 hunks)
nemo_skills/dataset/putnam-bench/__init__.py (1 hunks)
nemo_skills/dataset/ruler/prepare.py (1 hunks)
nemo_skills/dataset/scicode/__init__.py (1 hunks)
nemo_skills/dataset/simpleqa/__init__.py (1 hunks)
nemo_skills/dataset/supergpqa/__init__.py (1 hunks)
nemo_skills/dataset/svamp/__init__.py (1 hunks)
nemo_skills/dataset/swe-bench/__init__.py (1 hunks)
nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
nemo_skills/evaluation/evaluate_results.py (0 hunks)
nemo_skills/evaluation/evaluator/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/arena.py (1 hunks)
nemo_skills/evaluation/evaluator/base.py (2 hunks)
nemo_skills/evaluation/evaluator/bfcl.py (2 hunks)
nemo_skills/evaluation/evaluator/code.py (7 hunks)
nemo_skills/evaluation/evaluator/ifbench.py (1 hunks)
nemo_skills/evaluation/evaluator/ifeval.py (1 hunks)
nemo_skills/evaluation/evaluator/ioi.py (2 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)
nemo_skills/evaluation/evaluator/math.py (4 hunks)
nemo_skills/evaluation/evaluator/mcq.py (2 hunks)
nemo_skills/evaluation/evaluator/mrcr.py (2 hunks)
nemo_skills/evaluation/evaluator/ojbench.py (2 hunks)
nemo_skills/evaluation/evaluator/ruler.py (3 hunks)
nemo_skills/evaluation/evaluator/scicode.py (3 hunks)
nemo_skills/evaluation/math_grader.py (0 hunks)
nemo_skills/inference/eval/bfcl.py (1 hunks)
nemo_skills/inference/eval/scicode.py (1 hunks)
nemo_skills/inference/eval/swebench.py (2 hunks)
nemo_skills/inference/generate.py (10 hunks)
nemo_skills/inference/model/parallel_thinking.py (1 hunks)
nemo_skills/pipeline/eval.py (0 hunks)
nemo_skills/pipeline/generate.py (0 hunks)
nemo_skills/pipeline/utils/eval.py (5 hunks)
nemo_skills/pipeline/utils/generation.py (1 hunks)
nemo_skills/utils.py (1 hunks)
recipes/openmathreasoning/scripts/simplified_recipe.py (1 hunks)
recipes/openreasoning/eval.py (7 hunks)
tests/gpu-tests/test_generate.py (1 hunks)
tests/slurm-tests/qwen3_4b_evals/run_test.py (1 hunks)
tests/slurm-tests/super_49b_evals/run_test.py (1 hunks)
tests/test_configs.py (0 hunks)
tests/test_generation.py (1 hunks)

💤 Files with no reviewable changes (6)

nemo_skills/pipeline/eval.py
nemo_skills/dataset/answer-judge/init.py
nemo_skills/pipeline/generate.py
nemo_skills/evaluation/math_grader.py
tests/test_configs.py
nemo_skills/evaluation/evaluate_results.py

🧰 Additional context used

🧬 Code graph analysis (22)

recipes/openmathreasoning/scripts/simplified_recipe.py (1)

nemo_skills/pipeline/cli.py (1)

wrap_arguments (42-51)

nemo_skills/evaluation/evaluator/scicode.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (1)

nested_dataclass (63-96)

recipes/openreasoning/eval.py (1)

nemo_skills/pipeline/cli.py (1)

wrap_arguments (42-51)

nemo_skills/evaluation/evaluator/arena.py (1)

nemo_skills/utils.py (1)

get_logger_name (37-41)

nemo_skills/evaluation/evaluator/ifbench.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (1)

get_logger_name (37-41)

nemo_skills/inference/eval/scicode.py (1)

nemo_skills/utils.py (1)

remove_thinking (47-60)

nemo_skills/inference/model/parallel_thinking.py (1)

nemo_skills/utils.py (1)

remove_thinking (47-60)

nemo_skills/evaluation/evaluator/ruler.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (2)

get_logger_name (37-41)

nested_dataclass (63-96)

nemo_skills/evaluation/evaluator/code.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (2)

get_logger_name (37-41)

nested_dataclass (63-96)

nemo_skills/evaluation/evaluator/ifeval.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (1)

get_logger_name (37-41)

nemo_skills/evaluation/evaluator/mcq.py (3)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/evaluation/math_grader.py (1)

extract_answer (95-139)

nemo_skills/utils.py (1)

get_logger_name (37-41)

nemo_skills/evaluation/evaluator/bfcl.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (1)

nested_dataclass (63-96)

nemo_skills/inference/eval/swebench.py (2)

nemo_skills/utils.py (4)

get_help_message (335-355)

get_logger_name (37-41)

nested_dataclass (63-96)

setup_logging (99-134)

nemo_skills/inference/generate.py (1)

evaluate_single_datapoint (589-595)

nemo_skills/evaluation/evaluator/__init__.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

eval_full (42-72)

nemo_skills/evaluation/evaluator/ioi.py (1)

eval_full (377-393)

nemo_skills/evaluation/evaluator/base.py (2)

nemo_skills/utils.py (1)

nested_dataclass (63-96)

nemo_skills/evaluation/evaluator/ioi.py (1)

eval_full (377-393)

nemo_skills/evaluation/evaluator/ojbench.py (4)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (37-93)

nemo_skills/evaluation/evaluator/livecodebench.py (3)

execute_in_sandbox_with_retries (63-91)

is_sandbox_available (94-115)

sandbox_context (53-60)

nemo_skills/utils.py (1)

nested_dataclass (63-96)

nemo_skills/pipeline/utils/eval.py (1)

nemo_skills/inference/generate.py (1)

get_generation_default_args (247-255)

nemo_skills/evaluation/evaluator/math.py (4)

nemo_skills/code_execution/proof_utils.py (3)

ProofBuildConfig (26-32)

build_lean4_proof (93-133)

determine_proof_status (136-162)

nemo_skills/code_execution/sandbox.py (1)

get_sandbox (339-342)

nemo_skills/evaluation/evaluator/base.py (2)

BaseEvaluator (34-91)

BaseEvaluatorConfig (27-31)

nemo_skills/evaluation/math_grader.py (2)

extract_answer (95-139)

math_equal (37-92)

nemo_skills/evaluation/evaluator/ioi.py (3)

nemo_skills/evaluation/evaluator/base.py (3)

BaseEvaluator (34-91)

BaseEvaluatorConfig (27-31)

eval_full (42-72)

nemo_skills/file_utils.py (1)

jdump (68-100)

nemo_skills/utils.py (1)

nested_dataclass (63-96)

nemo_skills/evaluation/evaluator/mrcr.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (1)

get_logger_name (37-41)

nemo_skills/inference/generate.py (3)

nemo_skills/evaluation/evaluator/__init__.py (3)

evaluate (112-125)

get_evaluator_class (90-100)

supports_single_eval (103-109)

nemo_skills/evaluation/evaluator/base.py (2)

supports_single_eval (89-91)

eval_single (74-87)

nemo_skills/utils.py (1)

remove_thinking (47-60)

nemo_skills/evaluation/evaluator/livecodebench.py (3)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (37-93)

nemo_skills/utils.py (1)

nested_dataclass (63-96)

🪛 LanguageTool

docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md

[grammar] ~214-~214: There might be a mistake here.
Context: ... not specify ++remove_thinking=True as RULER has predefined setup for those par...

(QB_NEW_EN)

[grammar] ~215-~215: There might be a mistake here.
Context: ...ecify ++remove_thinking=True as RULER has predefined setup for those parameters. ...

(QB_NEW_EN)

docs/evaluation/index.md

[grammar] ~54-~54: There might be a mistake here.
Context: ... evaluation of reasoning models, either provide reasoning parser in server args (e.g. `...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/index.md

53-53: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🪛 Ruff (0.14.0)

nemo_skills/evaluation/evaluator/ifbench.py

39-39: subprocess call with shell=True identified, security issue

(S602)

46-46: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

51-51: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/code.py

264-264: Avoid specifying long messages outside the exception class

(TRY003)

276-276: Possible hardcoded password assigned to argument: "pass_k"

(S106)

338-338: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/ifeval.py

39-39: subprocess call with shell=True identified, security issue

(S602)

46-46: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

51-51: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/bfcl.py

63-63: subprocess call with shell=True identified, security issue

(S602)

69-69: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

nemo_skills/evaluation/evaluator/ojbench.py

103-103: Use explicit conversion flag

Replace with conversion flag

(RUF010)

105-105: Use explicit conversion flag

Replace with conversion flag

(RUF010)

106-106: Use explicit conversion flag

Replace with conversion flag

(RUF010)

122-122: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/math.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/ioi.py

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/livecodebench.py

217-217: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

262-262: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

nemo_skills/dataset/flores200/__init__.py

nemo_skills/dataset/mmlu-prox/__init__.py

coderabbitai · 2025-10-17T05:12:36Z

nemo_skills/evaluation/evaluator/ifbench.py

+    cmd = (
+        "cd /opt/benchmarks/IFBench && python -m run_eval "
+        f"--input_data={jsonl_file} "
+        f"--input_response_data={jsonl_file} "
+        f"--output_dir={output_dir} "
+    )
+    subprocess.run(cmd, shell=True, check=True)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Security: Review subprocess call with shell=True.

The use of subprocess.run with shell=True at Line 39 poses a security risk if jsonl_file or other variables contain untrusted input. While jsonl_file comes from cfg.input_file (likely controlled), ensure this path is validated or consider using shell=False with proper command construction.

If shell=True is necessary, ensure jsonl_file is properly validated and does not contain special shell characters. Alternatively, refactor to use shell=False:

cmd_list = [ "python", "-m", "run_eval", f"--input_data={jsonl_file}", f"--input_response_data={jsonl_file}", f"--output_dir={output_dir}", ] subprocess.run(cmd_list, cwd="/opt/benchmarks/IFBench", check=True)

I'll investigate the security concern by examining the actual code context and how the input parameters are handled.

#!/bin/bash # First, examine the actual code in the file around lines 33-39 fd -t f "ifbench.py" --exec cat {} \;

#!/bin/bash # Look at the function context more carefully rg -nB10 -A10 "subprocess.run.*shell=True" nemo_skills/evaluation/evaluator/ifbench.py

#!/bin/bash # Find where cfg.input_file is defined or used rg -n "input_file" nemo_skills/evaluation/evaluator/ifbench.py -B5 -A5

#!/bin/bash # Check for any validation functions in the module rg -n "def.*valid|validate|sanitize" nemo_skills/evaluation/evaluator/ifbench.py

#!/bin/bash # Look at imports and overall module structure head -50 nemo_skills/evaluation/evaluator/ifbench.py

#!/bin/bash # Check if there are any validation patterns for file paths in the codebase rg -n "cfg\.input" nemo_skills/ -C3 | head -50

#!/bin/bash # Check what happens with the output_dir parameter rg -n "output_dir" nemo_skills/evaluation/evaluator/ifbench.py -B3 -A3

#!/bin/bash # Look at the complete function signature and full body ast-grep --pattern $'def $$$($_) { $$$ }'

#!/bin/bash # Simpler approach: get the whole file to understand context wc -l nemo_skills/evaluation/evaluator/ifbench.py && echo "---" && cat nemo_skills/evaluation/evaluator/ifbench.py

```bash #!/bin/bash # Check for shell metacharacters that could break the command # Look at how cfg is typically used in this project rg -n "cfg\." nemo_skills/evaluation/evaluator/ifbench.py | head -20

--- I need to examine the actual code to verify this security concern.

🏁 Script executed:

#!/bin/bash cat nemo_skills/evaluation/evaluator/ifbench.py

Length of output: 2251

🏁 Script executed:

#!/bin/bash rg -n "input_file|output_dir|cfg\." nemo_skills/evaluation/evaluator/ifbench.py -B2 -A2

Length of output: 1178

🏁 Script executed:

#!/bin/bash rg -n "subprocess" nemo_skills/evaluation/evaluator/ifbench.py -B5 -A5

Length of output: 790

🏁 Script executed:

#!/bin/bash # Check for any validation of paths in this module or parent modules fd -t f "*.py" nemo_skills/evaluation/evaluator/ --exec rg -l "validate|sanitize|Path|pathlib" {} \;

Length of output: 283

🏁 Script executed:

#!/bin/bash # Check the config/parameter structure rg -n "class.*Config|@dataclass" nemo_skills/evaluation/evaluator/ -B2 -A10 | head -80

Length of output: 6010

#!/bin/bash # Find and inspect BaseEvaluatorConfig to see if input_file is validated fd "base.py" nemo_skills/evaluation/evaluator/ -exec cat {} \;

#!/bin/bash # Check if input_file appears to have any validation in the base config or docs rg -n "input_file" nemo_skills/evaluation/evaluator/base.py -B5 -A5

--- </details> **Use `shell=False` with `cwd` parameter to eliminate shell injection risk.** The current subprocess call at line 39 interpolates `jsonl_file` and `output_dir` directly into a shell command string. If these paths contain spaces or shell metacharacters (e.g., `;`, `$`, backticks), the command will parse incorrectly or execute unintended code. This is a recognized security vulnerability (CWE-78). Refactor to avoid shell parsing entirely: ```python cmd_list = [ "python", "-m", "run_eval", f"--input_data={jsonl_file}", f"--input_response_data={jsonl_file}", f"--output_dir={output_dir}", ] subprocess.run(cmd_list, cwd="/opt/benchmarks/IFBench", check=True)

This approach uses cwd parameter instead of cd and avoids shell interpretation of path variables.

🧰 Tools

🪛 Ruff (0.14.0)

39-39: subprocess call with shell=True identified, security issue

(S602)

🤖 Prompt for AI Agents

nemo_skills/evaluation/evaluator/ifbench.py around lines 33 to 39: the subprocess call builds a shell command string with interpolated paths and uses shell=True, which risks shell injection and misparsing; change to pass an argument list (e.g., ["python", "-m", "run_eval", f"--input_data={jsonl_file}", f"--input_response_data={jsonl_file}", f"--output_dir={output_dir}"]) to subprocess.run, remove the inline "cd" and instead set cwd="/opt/benchmarks/IFBench", and call subprocess.run with check=True (shell=False by default) so the command is executed securely without shell interpretation.

coderabbitai · 2025-10-17T05:12:36Z

nemo_skills/evaluation/evaluator/ifeval.py

+    for sample, eval_result in zip(samples, eval_results):
+        sample["loose_eval"] = eval_result


⚠️ Potential issue | 🟠 Major

Add strict=True to zip() for safer pairing.

The zip() call should use strict=True (Python 3.10+) to ensure samples and eval_results have matching lengths and catch potential data misalignment.

Apply this diff:

- for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["loose_eval"] = eval_result

Apply the same fix on line 51:

- for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["strict_eval"] = eval_result

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for sample, eval_result in zip(samples, eval_results):

sample["loose_eval"] = eval_result

for sample, eval_result in zip(samples, eval_results, strict=True):

sample["loose_eval"] = eval_result

🧰 Tools

🪛 Ruff (0.14.0)

46-46: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In nemo_skills/evaluation/evaluator/ifeval.py around lines 46-47 (and similarly at line 51), the zip(samples, eval_results) calls must be made strict to detect mismatched lengths; update both calls to zip(samples, eval_results, strict=True) so Python raises an error on length mismatch, ensuring safe pairing and avoiding silent data misalignment.

coderabbitai · 2025-10-17T05:12:36Z

nemo_skills/evaluation/evaluator/ioi.py

+        for s, o in zip(all_samples, outputs):
+            s["test_case_results"] = o["test_case_results"]
+            s["eval_status"] = o["eval_status"]


⚠️ Potential issue | 🔴 Critical

Add strict=True to zip() and fix missing eval_status field.

Two issues:

The zip() call should use strict=True to ensure all_samples and outputs have matching lengths.

Line 387 references o["eval_status"], but _evaluate_entry (lines 371-375) returns a dict with only name, subtask, and test_case_results fields—eval_status is missing.

Apply this diff:

- for s, o in zip(all_samples, outputs): + for s, o in zip(all_samples, outputs, strict=True): s["test_case_results"] = o["test_case_results"] - s["eval_status"] = o["eval_status"]

If eval_status is required, update _evaluate_entry to include it in the returned dict at line 371.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for s, o in zip(all_samples, outputs):

s["test_case_results"] = o["test_case_results"]

s["eval_status"] = o["eval_status"]

for s, o in zip(all_samples, outputs, strict=True):

s["test_case_results"] = o["test_case_results"]

🧰 Tools

🪛 Ruff (0.14.0)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In nemo_skills/evaluation/evaluator/ioi.py around lines 385-387, the zip(all_samples, outputs) call must use zip(..., strict=True) to ensure lengths match, and the code currently assigns s["eval_status"] = o["eval_status"] but _evaluate_entry (around line 371) does not return eval_status; update _evaluate_entry to compute and include an "eval_status" field in its returned dict (e.g., derive status from test_case_results or existing logic) so the caller can safely read o["eval_status"].

nemo_skills/evaluation/evaluator/ruler.py

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

nemo_skills/evaluation/evaluator/ruler.py (1)
68-80: Non-atomic write still risks data loss.

This is the same issue flagged in the previous review. The file is still overwritten in place without atomic guarantees. If the process crashes, runs out of disk space, or encounters a write error at line 80, the original data will be lost.

As suggested in the previous review, use a temporary file and os.replace() for atomic writes:
 jsonl_file = cfg.input_file
 with open(jsonl_file, "rt", encoding="utf-8") as fin:
     data = [json.loads(line) for line in fin]
     for sample in tqdm(data):
         parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
         sample["is_correct"] = match_type_funcs[eval_config.match_type](
             sample["generation"], sample["expected_answer"]
         )
         sample["predicted_answer"] = parse_result
 
+import os
+temp_file = jsonl_file + ".tmp"
-with open(jsonl_file, "wt", encoding="utf-8") as fout:
+with open(temp_file, "wt", encoding="utf-8") as fout:
     for sample in data:
         fout.write(json.dumps(sample) + "\n")
+
+os.replace(temp_file, jsonl_file)

🧹 Nitpick comments (5)

nemo_skills/evaluation/evaluator/ruler.py (3)
24-24: Unused logger variable.

The LOG variable is defined but never used in this file.

Consider removing it:
-LOG = logging.getLogger(get_logger_name(__file__))
Or use it to add informative logging, e.g., before processing:
LOG.info(f"Evaluating {jsonl_file} with parse_func={eval_config.parse_func}, match_type={eval_config.match_type}")
27-30: Consider adding field validation for parse_func and match_type.

The config allows any string for parse_func and match_type, but only specific values ("default" for parse_func; "all" or "part" for match_type) are valid. If an invalid value is provided, the evaluation will fail with a KeyError at runtime.

Consider adding validation in the config or early in eval_ruler:
@nested_dataclass(kw_only=True)
class RulerEvaluatorConfig(BaseEvaluatorConfig):
    parse_func: str = "default"
    match_type: str
    
    def __post_init__(self):
        valid_parse_funcs = {"default"}
        valid_match_types = {"all", "part"}
        if self.parse_func not in valid_parse_funcs:
            raise ValueError(f"parse_func must be one of {valid_parse_funcs}, got {self.parse_func}")
        if self.match_type not in valid_match_types:
            raise ValueError(f"match_type must be one of {valid_match_types}, got {self.match_type}")
72-76: Consider adding defensive validation for generation and expected_answer fields.

The test data confirms both fields are present in ruler-specific JSONL files (e.g., ./tests/data/score_rm_inputs/output.jsonl). However, there is no upstream validation at the caller level, and livecodebench.py demonstrates that schema validation is used elsewhere in the codebase. Adding a defensive check—similar to how livecodebench.py uses _preprocess_and_validate_file—would prevent crashes if incorrect data is passed to the evaluator:
 for sample in tqdm(data):
+    if "generation" not in sample or "expected_answer" not in sample:
+        LOG.warning(f"Skipping sample missing required fields: {sample.get('id', 'unknown')}")
+        continue
     parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
nemo_skills/utils.py (2)
37-41: Consider edge cases in path handling.

The current implementation has a few potential edge cases:

If "/nemo_skills/" appears multiple times in the path, split("nemo_skills")[1] uses the first occurrence, which might not be the desired behavior.

Windows paths use backslashes, but the function only checks for forward slashes.

The .replace(".py", "") only removes .py extensions but not other Python-related extensions like .pyx or .pyi.

Consider using Path operations for more robust path handling:
 def get_logger_name(file):
+    file_path = Path(file)
+    file_str = file_path.as_posix()  # Convert to forward slashes
-    if "/nemo_skills/" in file:
-        return "nemo_skills" + file.split("nemo_skills")[1].replace("/", ".").replace(".py", "")
+    if "/nemo_skills/" in file_str:
+        parts = file_str.split("/nemo_skills/", 1)  # Split only once
+        return "nemo_skills." + parts[1].replace("/", ".").removesuffix(file_path.suffix)
     else:
-        return f"[external] {Path(file).stem}"
+        return f"[external] {file_path.stem}"
47-64: Consider rate-limiting the warning to avoid log spam.

The warning on lines 59-64 is logged for every sample that doesn't have the thinking end tag. If many samples are missing the end tag (e.g., when accidentally using a non-reasoning model with remove_thinking=True), this could spam the logs significantly.

Consider one of these approaches:

Log the warning once per execution with a counter

Use a rate-limited logger

Add a summary at the end showing how many samples were affected

Example approach 1:
+_THINKING_END_WARNING_LOGGED = False
+
 def remove_thinking(sample: dict, generation_key: str = "generation", thinking_end: str = "</think>"):
+    global _THINKING_END_WARNING_LOGGED
     # not doing anything if generation isn't a string
     # TODO: should we be more explicit about this?
     if not isinstance(sample[generation_key], str):
         return
     sample[f"_{generation_key}_finished_thinking"] = thinking_end in sample[generation_key]
     if thinking_end in sample[generation_key]:
         sample[f"_full_{generation_key}"] = sample[generation_key]
         sample[generation_key] = sample[generation_key].split(thinking_end)[-1].strip()
     else:
         sample[f"_full_{generation_key}"] = sample[generation_key]
         sample[generation_key] = ""  # no end tag, so setting the generation to empty
-        LOG.warning(
-            "Thinking end tag `%s` not found in generation; setting generation to empty. "
-            "If this happens for every generation, you might have accidentally set ++remove_thinking=True for a "
-            "non-reasoning model or have incorrect end tag.",
-            thinking_end,
-        )
+        if not _THINKING_END_WARNING_LOGGED:
+            LOG.warning(
+                "Thinking end tag `%s` not found in generation; setting generation to empty. "
+                "If this happens for every generation, you might have accidentally set ++remove_thinking=True for a "
+                "non-reasoning model or have incorrect end tag. (Further warnings suppressed)",
+                thinking_end,
+            )
+            _THINKING_END_WARNING_LOGGED = True

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bc6920 and 1dd43cb.

📒 Files selected for processing (5)

nemo_skills/dataset/mmlu-prox/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/ruler.py (3 hunks)
nemo_skills/inference/generate.py (11 hunks)
nemo_skills/inference/model/parallel_thinking.py (2 hunks)
nemo_skills/utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/mmlu-prox/init.py

🧰 Additional context used

🧬 Code graph analysis (3)

nemo_skills/inference/model/parallel_thinking.py (1)

nemo_skills/utils.py (1)

remove_thinking (47-64)

nemo_skills/evaluation/evaluator/ruler.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (2)

get_logger_name (37-41)

nested_dataclass (67-100)

nemo_skills/inference/generate.py (3)

nemo_skills/evaluation/evaluator/__init__.py (3)

evaluate (112-125)

get_evaluator_class (90-100)

supports_single_eval (103-109)

nemo_skills/evaluation/evaluator/base.py (2)

supports_single_eval (89-91)

eval_single (74-87)

nemo_skills/utils.py (1)

remove_thinking (47-64)

🪛 Ruff (0.14.0)

nemo_skills/inference/generate.py

644-644: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (11)

nemo_skills/evaluation/evaluator/ruler.py (1)

58-58: LGTM!

The config instantiation correctly uses the new RulerEvaluatorConfig with dictionary unpacking.

nemo_skills/utils.py (1)

44-44: LGTM!

Module-level logger initialization follows standard Python logging patterns.
nemo_skills/inference/generate.py (8)
35-39: LGTM!

Evaluation imports are now centralized and unconditional, which aligns with the new evaluation-into-generation integration approach described in the PR.

180-189: LGTM!

The addition of eval_type and eval_config fields to GenerateSolutionsConfig properly consolidates evaluation configuration into the generation parameters, as described in the PR objectives. The warning comment about remove_thinking is helpful for preventing misuse.

337-343: Evaluator setup looks correct.

The conditional initialization based on supports_single_eval() properly distinguishes between per-datapoint (interleaved) and batch evaluation flows. Note that line 340 mutates self.cfg.eval_config to ensure it's a plain dict—this is likely intentional for compatibility with evaluator classes, but be aware that the original type information is lost.

452-455: LGTM!

The run_batch_evaluation method correctly configures the evaluator to consume the generated output file. Overriding any existing input_file in eval_config is the correct behavior here.

525-545: LGTM with a minor observation.

The postprocess_single_output method correctly handles renaming, stat removal, data merging, and thinking removal. The logic at lines 537-539 properly merges the original data point with the generated output, preserving output values when keys overlap.

Note: This method is declared async but contains no await statements. This is acceptable if it's for future extensibility or to maintain a consistent async API, but consider adding a comment explaining this if it's intentional.

589-595: LGTM!

The evaluate_single_datapoint method correctly times the evaluation and merges results into the data point. The metric name interleaved_eval_single_time_s clearly distinguishes it from batch evaluation timing.

609-613: LGTM!

The pipeline correctly sequences postprocessing before evaluation, which is important since the evaluator expects the postprocessed output format. The conditional check if self.should_run_evaluation and self.evaluator properly handles cases where evaluation is requested but only batch evaluation is supported.

717-737: Verify batch evaluation behavior when no data is processed.

When len(data) == 0 (line 717), generation is skipped but the batch evaluation check on lines 735-736 is still executed. This could attempt to evaluate an empty or non-existent output file.

Please verify this is the intended behavior. If not, consider moving the batch evaluation inside the else block:
         if len(data) == 0:
             LOG.info("No data to process, skipping generation")
         else:
             data = self.preprocess_data(data)
 
             self.log_example_prompt(data)
 
             if self.cfg.dry_run:
                 LOG.info("Exiting without running generation as dry_run flag is set.")
                 return
 
             if not self.cfg.skip_filled:
                 for output_path in [Path(self.cfg.output_file), Path(self.cfg.output_file + "-async")]:
                     if output_path.exists():
                         output_path.unlink()
 
             self.wait_for_server()
             asyncio.run(self.async_loop(data))
-
-        if self.should_run_evaluation and self.evaluator is None:
-            self.run_batch_evaluation()
+            
+            if self.should_run_evaluation and self.evaluator is None:
+                self.run_batch_evaluation()
+                
         self.postprocess()
nemo_skills/inference/model/parallel_thinking.py (1)

152-157: LGTM! Correctly updated remove_thinking calls.

The thinking_begin parameter has been properly removed from both remove_thinking calls, aligning with the updated function signature that now relies solely on thinking_end for processing.

Also applies to: 187-192

coderabbitai · 2025-10-17T05:33:09Z

nemo_skills/inference/generate.py

+                for output, data_point in zip(prefilled_outputs, prefilled_data_points):
+                    await self.postprocess_single_output(output, data_point)
+
+                    # evaluate single-data point if requested and evaluator supports that
+                    if self.should_run_evaluation and self.evaluator:
+                        output = await self.evaluate_single_datapoint({**data_point, **output})


🛠️ Refactor suggestion | 🟠 Major

Add strict=True to zip() for safety.

The static analysis tool correctly flags that zip() on line 644 should use an explicit strict= parameter. Without it, if prefilled_outputs and prefilled_data_points have different lengths, the zip will silently truncate to the shorter length, potentially causing data loss.

- for output, data_point in zip(prefilled_outputs, prefilled_data_points): + for output, data_point in zip(prefilled_outputs, prefilled_data_points, strict=True): await self.postprocess_single_output(output, data_point)

Note: strict=True requires Python 3.10+. If supporting earlier versions, add a length check before the loop.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for output, data_point in zip(prefilled_outputs, prefilled_data_points):

await self.postprocess_single_output(output, data_point)

# evaluate single-data point if requested and evaluator supports that

if self.should_run_evaluation and self.evaluator:

output = await self.evaluate_single_datapoint({**data_point, **output})

for output, data_point in zip(prefilled_outputs, prefilled_data_points, strict=True):

await self.postprocess_single_output(output, data_point)

# evaluate single-data point if requested and evaluator supports that

if self.should_run_evaluation and self.evaluator:

output = await self.evaluate_single_datapoint({**data_point, **output})

🧰 Tools

🪛 Ruff (0.14.0)

644-644: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In nemo_skills/inference/generate.py around lines 644 to 649, the loop zipping prefilled_outputs and prefilled_data_points should not silently truncate if lengths differ; either add zip(..., strict=True) to enforce equal lengths (requires Python 3.10+) or, if older Python must be supported, check len(prefilled_outputs) == len(prefilled_data_points) before the loop and raise or handle the mismatch; update the code to use one of these approaches so mismatched lengths are detected rather than silently dropped.

coderabbitai · 2025-10-17T05:33:09Z

nemo_skills/inference/model/parallel_thinking.py


-    remove_thinking: bool = True  # Remove thinking tokens from the solution key
-    thinking_begin: str = "<think>"
+    remove_thinking: bool = False


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that the breaking default change is documented.

The default value for remove_thinking has changed from True to False. This is a breaking change that may affect existing configurations that relied on the previous default behavior.

Ensure that:

This change is clearly documented in the PR description, migration guide, or release notes

Existing configurations that depend on thinking removal explicitly set remove_thinking=True

Run the following script to find configurations that might be affected:

🏁 Script executed:

#!/bin/bash # Description: Find ParallelThinkingConfig usage without explicit remove_thinking setting # Search for ParallelThinkingConfig instantiations rg -n -A 10 -B 2 'ParallelThinkingConfig\(' --type=py # Search for YAML/config files that might configure parallel thinking fd -e yaml -e yml -e json --exec rg -l 'parallel.*thinking|genselect|gensynthesis' {}

Length of output: 1043

I need to examine the configuration instantiation more closely and inspect the configuration files that might be affected.

🏁 Script executed:

#!/bin/bash # Understand ParallelThinkingConfig instantiation and parameter filtering # Get full context around line 91 in __init__.py echo "=== Context around ParallelThinkingConfig instantiation ===" rg -n -B 15 -A 5 'ParallelThinkingConfig\(' nemo_skills/inference/model/__init__.py | head -40 # Check the qwq.yaml config file echo -e "\n=== Content of qwq.yaml ===" cat ./recipes/openmathreasoning/configs/genselect_sdg/qwq.yaml 2>/dev/null || echo "File not found" # Search for all config files in recipes that might use parallel_thinking or remove_thinking echo -e "\n=== All YAML config files in recipes ===" fd -e yaml -e yml ./recipes --exec rg -l 'remove_thinking|parallel' {} \; 2>/dev/null | head -20

Length of output: 5255

🏁 Script executed:

#!/bin/bash # Understand ParallelThinkingConfig instantiation and parameter filtering # Get full context around line 91 in __init__.py echo "=== Context around ParallelThinkingConfig instantiation ===" rg -n -B 15 -A 5 'ParallelThinkingConfig\(' nemo_skills/inference/model/__init__.py | head -50 # Check the qwq.yaml config file echo -e "\n=== Content of qwq.yaml ===" cat ./recipes/openmathreasoning/configs/genselect_sdg/qwq.yaml 2>/dev/null || echo "File not found" # Search for valid_params definition echo -e "\n=== Searching for valid_params definition ===" rg -n -B 5 -A 5 'valid_params' nemo_skills/inference/model/__init__.py

Length of output: 5873

🏁 Script executed:

#!/bin/bash # Search for explicit remove_thinking settings echo "=== Searching for remove_thinking settings in code ===" rg -n 'remove_thinking\s*=' --type=py -B 2 -A 2 echo -e "\n=== Searching for remove_thinking in YAML configs ===" fd -e yaml -e yml -exec rg -l 'remove_thinking' {} \; echo -e "\n=== Check ParallelThinkingConfig dataclass definition ===" rg -n -B 2 -A 15 'class ParallelThinkingConfig' --type=py echo -e "\n=== Search for where parallel_thinking_config is passed in code ===" rg -n 'remove_thinking' nemo_skills/inference/model/parallel_thinking.py | head -20

Length of output: 7074

🏁 Script executed:

#!/bin/bash # Search for explicit remove_thinking settings echo "=== Searching for remove_thinking settings in code ===" rg -n 'remove_thinking\s*=' --type=py -B 2 -A 2 echo -e "\n=== Searching for remove_thinking in YAML configs ===" fd -e yaml -e yml -exec rg -l 'remove_thinking' {} \; echo -e "\n=== Check ParallelThinkingConfig dataclass definition ===" rg -n -B 2 -A 20 'class ParallelThinkingConfig' --type=py echo -e "\n=== Look at actual usage of remove_thinking in parallel_thinking.py ===" rg -n 'self.cfg.remove_thinking' nemo_skills/inference/model/parallel_thinking.py

Length of output: 7356

🏁 Script executed:

#!/bin/bash # Verify the remove_thinking function behavior to confirm logic bug echo "=== Remove thinking function implementation ===" rg -n -A 25 'def remove_thinking' nemo_skills/utils.py echo -e "\n=== Context around line 226 in parallel_thinking.py ===" sed -n '220,235p' nemo_skills/inference/model/parallel_thinking.py

Length of output: 2153

Fix critical logic error on line 226 that filters out all solutions when remove_thinking=True.

After remove_thinking() is called, the thinking_end tag should never remain in the generation. The function either splits it out (line 55 in utils.py) or sets generation to empty (line 58). Therefore, the condition on line 226 will always be true, causing all solutions to be filtered when remove_thinking=True.

The intent appears to be filtering unfinished solutions. Use the _{solution_key}_finished_thinking flag set by remove_thinking() instead:

# filtering unfinished solutions if self.cfg.remove_thinking and not solution.get(f"_{self.cfg.solution_key}_finished_thinking", True): continue

Regarding the default value change on line 54: No breaking change impact. All 9 existing uses explicitly override with ++remove_thinking=True via Hydra configuration.

coderabbitai · 2025-10-17T05:33:10Z

nemo_skills/inference/model/parallel_thinking.py

+                # filtering unfinished solutions
+                if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
                    continue


⚠️ Potential issue | 🔴 Critical

Critical: Filtering logic will incorrectly remove all solutions.

After remove_thinking is called (lines 152-157 in generate_solutions and lines 187-192 in _load_solutions), the thinking_end marker is never present in the processed generation because:

If thinking_end was present: remove_thinking splits on it and takes the text after

If thinking_end was NOT present: remove_thinking sets generation to an empty string

Therefore, thinking_end not in solution[self.cfg.solution_key] will always be True after remove_thinking is applied, causing all solutions to be filtered out when remove_thinking=True.

To filter incomplete solutions correctly, check the flag set by remove_thinking:

# Remove unfinished solutions filtered_solutions = [] for solution in solutions: - # filtering unfinished solutions - if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]: + # Filter out solutions where thinking was incomplete (thinking_end not found) + if self.cfg.remove_thinking and not solution["output_dict"].get(f"_{self.cfg.solution_key}_finished_thinking", True): continue else: filtered_solutions.append(solution)

Alternatively, check if the generation is empty:

- if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]: + if self.cfg.remove_thinking and not solution[self.cfg.solution_key]: continue

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In nemo_skills/inference/model/parallel_thinking.py around lines 225 to 227, the filter condition uses "thinking_end not in solution[self.cfg.solution_key]" which will always be true after remove_thinking has run (it either strips the marker or leaves an empty string), so all solutions are dropped; instead, change the logic to detect incomplete solutions using the flag set by remove_thinking (e.g., a boolean returned or stored when stripping thinking) or simply check for an empty generation string (e.g., skip if solution[self.cfg.solution_key] == "" or length == 0), or, if you need to inspect the original text, use the pre-processed/original generation field to test for the presence of thinking_end before remove_thinking was applied.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/inference/model/parallel_thinking.py (1)
370-398: Critical: Missing early return causes crash when all solutions are filtered out.

When solutions is empty (e.g., all filtered by solution_length_cap), lines 371-386 create a sensible default output_dict. However, execution continues to Step 2 (lines 388-398), which overwrites output_dict and attempts to run genselect/gensynthesis on an empty solutions list, causing crashes:

GenSelect (line 326): local_random.randint(0, -1) raises ValueError

GenSynthesis (line 349): local_random.choice([]) raises IndexError

Add an early return after creating the default output:
         if not solutions:
             output_dict = {
                 self.cfg.solution_key: "",
                 "solution_list": [],
                 f"{self.cfg.mode}_comparison": "",
                 f"{self.cfg.mode}_num_generated_tokens": 0,
                 "total_solution_generated_tokens": total_num_generated_tokens,
                 "num_generated_tokens": total_num_generated_tokens,  # No additional tokens for genselect
                 "num_best_solution_generated_tokens": 0,
             }
 
             # Required by inference/generate.py
             output_dict["generation"] = ""
             if self.cfg.count_prompt_tokens:
                 # The input doesn't make sense for such cases where there are no solutions
                 output_dict["num_input_tokens"] = None
+            return output_dict
 
         # Step 2: Run GenSelect/GenSynthesis

♻️ Duplicate comments (1)

nemo_skills/inference/generate.py (1)

647-652: Add strict=True to zip() for safety.

This issue has already been raised in a previous review. The zip() call should use strict=True to prevent silent data loss if the lists have mismatched lengths.

🧹 Nitpick comments (6)

tests/slurm-tests/run_all.sh (1)
3-14: Consider adding a usage comment for script arguments.

While the code is correct and aligns well with the PR's parameter consolidation, adding a brief comment at the top explaining the positional arguments would aid future maintainers. For example:
 #!/bin/bash
 
+# Usage: ./run_all.sh <CLUSTER> [RUN_NAME]
+# CLUSTER: target cluster (required)
+# RUN_NAME: test run identifier; defaults to YYYY-MM-DD (optional)
+
 CLUSTER=$1
 RUN_NAME=${2:-$(date +%Y-%m-%d)}
nemo_skills/inference/eval/swebench.py (1)

150-152: Consider the design of rejected configuration fields.

These fields are added to the config but immediately rejected in __init__ (lines 176-179). This creates a potentially confusing API where users might attempt to set eval_type only to receive an error.

If these fields are present solely for interface consistency with the base GenerationTask class, consider adding a comment explaining this design decision. Alternatively, if they serve no purpose, they could be omitted entirely from this configuration class.
nemo_skills/evaluation/evaluator/ruler.py (1)
69-83: Atomic write implementation correctly addresses the previous data loss concern.

The new implementation using a temporary file and os.replace() prevents data corruption if the process is interrupted. This resolves the critical issue flagged in the previous review.

However, there are two optional improvements to consider:

Processing loop placement: Lines 72-77 process data inside the file reading context manager, but the file is no longer needed after line 71. Moving the loop outside would be clearer.

Temp file cleanup: If an exception occurs during write or replace (lines 79-83), the temp file will be left behind. Adding a try-finally block would ensure cleanup.

Optional refactor to address both points:
 jsonl_file = cfg.input_file
 with open(jsonl_file, "rt", encoding="utf-8") as fin:
     data = [json.loads(line) for line in fin]
-    for sample in tqdm(data):
-        parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
-        sample["is_correct"] = match_type_funcs[eval_config.match_type](
-            sample["generation"], sample["expected_answer"]
-        )
-        sample["predicted_answer"] = parse_result
 
+for sample in tqdm(data):
+    parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
+    sample["is_correct"] = match_type_funcs[eval_config.match_type](
+        sample["generation"], sample["expected_answer"]
+    )
+    sample["predicted_answer"] = parse_result
+
+temp_file = jsonl_file + "-tmp"
+try:
-    with open(jsonl_file + "-tmp", "wt", encoding="utf-8") as fout:
+    with open(temp_file, "wt", encoding="utf-8") as fout:
         for sample in data:
             fout.write(json.dumps(sample) + "\n")
-
-os.replace(jsonl_file + "-tmp", jsonl_file)
+    os.replace(temp_file, jsonl_file)
+except Exception:
+    if os.path.exists(temp_file):
+        os.remove(temp_file)
+    raise
nemo_skills/pipeline/utils/eval.py (1)

168-171: Environment pollution TODO is a valid concern.

Setting os.environ["NEMO_SKILLS_PRIVILEGED_DOCKER"] at line 170 persists across function calls and could affect other benchmarks processed in the same run. Consider scoping this to the specific swe-bench job or resetting it after use.

Would you like me to open an issue to track proper isolation of this environment variable, or suggest a pattern using context managers for temporary environment changes?
nemo_skills/inference/generate.py (1)
528-548: Consider making postprocess_single_output synchronous.

The method is declared async but performs no await operations. Unless async is needed for future extensibility, consider making it a regular synchronous method to simplify the call sites.

Apply this diff if you want to make it synchronous:
-    async def postprocess_single_output(self, output, original_data_point):
+    def postprocess_single_output(self, output, original_data_point):
Then update call sites at lines 612, 648 to remove await:
-        await self.postprocess_single_output(output, data_point)
+        self.postprocess_single_output(output, data_point)
nemo_skills/inference/model/parallel_thinking.py (1)
74-74: Remove unused configuration field.

The filter_incomplete_solutions field is no longer used anywhere in the code. According to the enriched summary, the filtering logic that used this field was intentionally removed. The config field should be cleaned up.
-    filter_incomplete_solutions: bool = True  # Filter out incomplete solutions

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dd43cb and 166a392.

📒 Files selected for processing (8)

nemo_skills/evaluation/evaluator/ruler.py (3 hunks)
nemo_skills/inference/eval/swebench.py (4 hunks)
nemo_skills/inference/generate.py (11 hunks)
nemo_skills/inference/model/parallel_thinking.py (1 hunks)
nemo_skills/pipeline/generate.py (0 hunks)
nemo_skills/pipeline/utils/eval.py (5 hunks)
nemo_skills/utils.py (1 hunks)
tests/slurm-tests/run_all.sh (1 hunks)

💤 Files with no reviewable changes (1)

nemo_skills/pipeline/generate.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/utils.py

🧰 Additional context used

🧬 Code graph analysis (5)

nemo_skills/inference/model/parallel_thinking.py (1)

nemo_skills/utils.py (1)

remove_thinking (48-65)

nemo_skills/inference/generate.py (3)

nemo_skills/evaluation/evaluator/__init__.py (3)

evaluate (112-125)

get_evaluator_class (90-100)

supports_single_eval (103-109)

nemo_skills/evaluation/evaluator/base.py (2)

supports_single_eval (89-91)

eval_single (74-87)

nemo_skills/utils.py (1)

remove_thinking (48-65)

nemo_skills/inference/eval/swebench.py (2)

nemo_skills/utils.py (4)

get_help_message (340-360)

get_logger_name (38-42)

nested_dataclass (68-101)

setup_logging (104-139)

nemo_skills/inference/generate.py (1)

evaluate_single_datapoint (592-598)

nemo_skills/evaluation/evaluator/ruler.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluatorConfig (27-31)

nemo_skills/utils.py (2)

get_logger_name (38-42)

nested_dataclass (68-101)

nemo_skills/pipeline/utils/eval.py (1)

nemo_skills/inference/generate.py (1)

get_generation_default_args (247-255)

🪛 Ruff (0.14.0)

nemo_skills/inference/generate.py

647-647: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/inference/eval/swebench.py

177-179: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (24)

tests/slurm-tests/run_all.sh (2)

3-4: Good parameter consolidation with sensible default.

The refactor to make the date/run identifier a configurable second argument is clean and backward compatible. Callers can now override the date if needed (e.g., for re-running failed jobs with the same date), while defaulting to today's date for common use.

6-14: Consistent application of RUN_NAME across all test invocations.

All five test jobs properly use $RUN_NAME in both --workspace and --expname_prefix paths, ensuring a coherent naming scheme across parallel jobs.

nemo_skills/inference/eval/swebench.py (3)

33-38: LGTM: Clean import organization.

The multi-line import format improves readability with no functional change.

176-179: LGTM: Clear validation prevents misconfiguration.

The guard correctly prevents users from attempting to configure evaluation via eval_type, since SWE-bench handles evaluation automatically through its harness. The error message clearly guides users to the correct behavior.

196-198: LGTM: Method rename aligns with standardized evaluation interface.

The rename from apply_evaluation_hook to evaluate_single_datapoint matches the base class pattern (as seen in generate.py). The pass-through implementation is appropriate here since SWE-bench performs evaluation inline during process_single_datapoint (lines 495-564).

nemo_skills/evaluation/evaluator/ruler.py (3)

17-17: LGTM! Imports align with the refactoring.

The new imports (os, BaseEvaluatorConfig) support the atomic write implementation and the updated class hierarchy.

Also applies to: 22-23

29-29: LGTM! Class hierarchy updated correctly.

Inheriting from BaseEvaluatorConfig provides access to input_file, data_dir, and split fields, which aligns with the PR's goal of fusing top-level evaluation config with inner eval_config.

59-59: LGTM! Config construction updated per PR design.

The change from **cfg.eval_config to **cfg reflects the flattened configuration structure where evaluation parameters are now at the top level.

nemo_skills/pipeline/utils/eval.py (4)

129-133: Backward compatibility for deprecated EVAL_ARGS looks good.

The logic correctly prepends legacy EVAL_ARGS to generation_args while injecting the new eval_config.split format. This provides a clean migration path.

223-223: eval_config.data_dir injection is correct.

The conditional addition of eval_config.data_dir to generation_args aligns with the PR's goal to consolidate evaluation configuration into generation arguments.

243-243: eval_config.data_dir injection is correct.

Consistent with line 223, this properly injects eval_config.data_dir into generation_args for single-benchmark paths.

424-428: Argument concatenation order is correct.

The ordering (generation defaults → benchmark args → job args) properly supports override semantics where job-specific arguments can override benchmark and default settings. This aligns well with Hydra's configuration override behavior.

nemo_skills/inference/generate.py (11)

35-39: LGTM!

The evaluation imports are correctly added to support the new integrated evaluation flow.

87-87: LGTM!

Updated docstring appropriately describes the config class.

180-182: LGTM!

The comment appropriately warns users about the remove_thinking behavior change, which is critical for preventing empty generations on non-reasoning models.

187-189: LGTM!

The evaluation configuration fields are properly integrated into the generation config, enabling opt-in evaluation.

337-343: LGTM!

The evaluator initialization logic correctly distinguishes between per-datapoint and batch evaluation modes. The code properly checks if the evaluator supports single evaluation before instantiating it.

401-402: LGTM!

The comment clarifies the intent of the inference override config pattern.

455-458: LGTM!

The batch evaluation method is straightforward and correctly configured. It properly uses the output file as input for evaluation.

524-526: LGTM!

The simplified dump_outputs method now focuses solely on writing, with preprocessing moved to postprocess_single_output.

592-598: LGTM!

The per-datapoint evaluation method correctly times and executes single evaluations, providing performance insights.

612-616: LGTM!

The integration of postprocessing and per-datapoint evaluation is correct. Postprocessing runs before evaluation as intended, and the conditional gating ensures evaluation only runs when appropriate.

720-740: LGTM!

The refactored control flow correctly handles the two evaluation modes:

Per-datapoint evaluation when evaluator is available (interleaved with generation)

Batch evaluation when eval_type is set but evaluator is None (after generation completes)

The logic is clear and aligns with the PR objectives.

nemo_skills/inference/model/parallel_thinking.py (1)

153-158: LGTM: remove_thinking integration is correct.

The remove_thinking calls properly pass generation_key and thinking_end parameters, and correctly omit the removed thinking_begin parameter. The function modifies the dict in-place to set the _{generation_key}_finished_thinking flag and preserve the full generation in _full_{generation_key}, which aligns with the PR's changes to thinking-token handling.

Also applies to: 195-200

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_skills/inference/eval/swebench.py (2)
150-152: Clarify or remove unused eval_type field.

The eval_type field definition suggests it's configurable, but the validation in __init__ rejects any non-None value. This creates API confusion for users who might attempt to configure evaluation.

Consider one of the following approaches:

Remove the eval_type and eval_config fields entirely from SweBenchGenerationConfig if they're not applicable to SWE-bench.

Add a comment in the field definition clarifying that these fields exist for interface consistency but are not used for SWE-bench.

Move these fields to a base config class and have SWE-bench override/validate them appropriately.

Apply this diff to remove the unused fields:
-    # Evaluation setup if requested. If eval_type is set to None, evaluation is skipped
-    eval_type: str | None = None  # "lean4-proof", "math", etc.
-    eval_config: dict = field(default_factory=dict)  # Config for the evaluator
-
And remove the corresponding validation:
-        if self.cfg.eval_type is not None:
-            raise ValueError(
-                "SWE-bench generation task does not support eval_type parameter. Evaluation is done automatically."
-            )
-
Also applies to: 176-179

181-182: Remove unused evaluation flags.

The should_run_evaluation and evaluator attributes are initialized but never used in the code. Since SWE-bench performs evaluation inline during generation (in process_single_datapoint), these flags are unnecessary.

Apply this diff to remove the unused attributes:
-        self.should_run_evaluation = False
-        self.evaluator = None
-

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 166a392 and 3676d7f.

📒 Files selected for processing (4)

docs/basics/index.md (1 hunks)
nemo_skills/dataset/simpleqa/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/mcq.py (2 hunks)
nemo_skills/inference/eval/swebench.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

nemo_skills/dataset/simpleqa/init.py
nemo_skills/evaluation/evaluator/mcq.py
docs/basics/index.md

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/inference/eval/swebench.py (2)

nemo_skills/utils.py (4)

get_help_message (340-360)

get_logger_name (38-42)

nested_dataclass (68-101)

setup_logging (104-139)

nemo_skills/inference/generate.py (1)

evaluate_single_datapoint (592-598)

🪛 Ruff (0.14.1)

nemo_skills/inference/eval/swebench.py

177-179: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/inference/eval/swebench.py (2)

33-38: LGTM!

The import consolidation improves readability and follows Python conventions.

199-201: LGTM!

The method rename to evaluate_single_datapoint aligns with the base class interface. The no-op implementation is appropriate since SWE-bench performs evaluation inline during process_single_datapoint.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

…ions when remove_thinking=True (#958) Signed-off-by: Igor Gitman <igitman@nvidia.com>

…ions when remove_thinking=True (#958) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Kipok added 14 commits October 16, 2025 15:11

Remove separate eval_args

d5074e1

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Upgrade all interfaces

96e7114

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes

8b23dcd

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix dump_outputs

a6da01a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Dont skip eval

e3744c8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix omegaconf error

060488b

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix typos

a84c6b2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Init all base configs in evaluators

2dc8608

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Default input_file to None

e62f09e

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix config access

4232e08

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update inits

b51ee19

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Make remove_thinking empty generations even if no begin tag

6e747cc

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update ++remove_thinking everywhere

c01c7da

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add warning about remove_thinking

9bc6920

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Automatically ignore remove_thinking if generation isn't a string

85ca59a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

Kipok added 2 commits October 16, 2025 22:23

Address rabits comments

746fdd8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Rollback test change

1dd43cb

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

Kipok added 6 commits October 17, 2025 10:00

Add missing parameters for swebench

cfa20f5

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add eval args fallback

441ba58

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/reasoning-on-arg

81ab918

Update run_all.sh to take name as arg

2bfdd6c

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/reasoning-on-arg

66722be

Fix ruler and parallel thinking

166a392

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

Kipok added 2 commits October 21, 2025 10:51

Merge branch 'main' into igitman/reasoning-on-arg

cbe61bc

Fix swebench

3676d7f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

Kipok added 15 commits October 23, 2025 13:36

Update split in bigcodebench

70b3a91

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix flores default split

f0558fb

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes

88aa352

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix eval test

f052122

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/reasoning-on-arg

ba50f52

Rename math -> hendrycks_math

19eeaf8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove judge tests for now

6c8f179

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Clean up

c0fefe9

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix math prepare

3868133

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove test

1f5177f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/reasoning-on-arg

e3ce4c9

Fixes

1e11165

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix for local non-sequential in generate

71d2e8c

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix tests

3dd7b5b

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix tests

e5b92e7

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok merged commit 7f4687a into main Oct 24, 2025
5 checks passed

Kipok deleted the igitman/reasoning-on-arg branch October 24, 2025 05:18

This was referenced Oct 24, 2025

Remove the Lean4 Statement Check Logic and Lean Proof logic from batch_evaluate_results #840

Closed

eval_single / eval_full should support a few more features like pre/postprocessing and remove-thinking #841

Closed

fayejf mentioned this pull request Oct 24, 2025

Fix AA-LCR evaluation #952

Merged

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

Moving evaluation inside generation class and enforcing empty generat…

91b82b1

…ions when remove_thinking=True (#958) Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot mentioned this pull request Dec 17, 2025

BIRD Benchmark (Text-to-SQL) #1122

Closed

coderabbitai bot mentioned this pull request Jan 15, 2026

HF ASR Leaderboard Fix #1140

Merged

coderabbitai bot mentioned this pull request Jan 28, 2026

add multilingual aime25, gqpa and lcb #1194

Open

This was referenced Feb 11, 2026

Add CritPt benchmark #1200

Merged

Adding an option to store benchmarks in external repo #1240

Merged

This was referenced Feb 26, 2026

Add SPEED-bench #1276

Closed

Add SPEED-Bench (within repo) #1279

Merged

coderabbitai bot mentioned this pull request Mar 5, 2026

Eval kit support #1239

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Moving evaluation inside generation class and enforcing empty generat…

c39d2bc

…ions when remove_thinking=True (#958) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

		for sample, eval_result in zip(samples, eval_results):
		sample["loose_eval"] = eval_result

Conversation

Kipok commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kipok commented Oct 17, 2025 •

edited

Loading

coderabbitai bot commented Oct 17, 2025 •

edited

Loading