Skip to content

Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True#958

Merged
Kipok merged 50 commits intomainfrom
igitman/reasoning-on-arg
Oct 24, 2025
Merged

Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True#958
Kipok merged 50 commits intomainfrom
igitman/reasoning-on-arg

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Oct 17, 2025

  1. evaluate_results.py is removed, it's logic is moved inside generation class
  2. remove_thinking=True doesn't check for beginning tag anymore (it's not set for many models, so it's not reliable). This means that if end of thinking isn't present we set generation to empty. Which in turn means that setting remove_thinking=True for non-reasoning models will be disastrous and thus the default is changed to False. It's now necessary to set it explicitly (or use a reasoning-parser on the server side) for all reasoning models if thinking shouldn't be considered in evaluation
  3. EVAL_ARGS are removed from all inits and moved into generation args
  4. extra_eval_args / eval_args parameters are removed (now specified as normal generation args)
  5. remove unused batch_evaluate_results and leanstatement evaluator (addresses Remove the Lean4 Statement Check Logic and Lean Proof logic from batch_evaluate_results #840)
  6. remove ++eval_type=no-op (now same behavior is achieved by simply not specifying eval_type)
  7. the top-level evaluation config is fused with inner "eval_config", so all parameters are available inside evaluator classes, including input_files, data_dir and split (fyi @SeanNaren)
  8. input_files is replaced with a single input_file everywhere in evaluation. It's now assumed it's being run on a single input only (which was already the case for a while anyway).
  9. rename apply_evaluation_hook -> evaluate_single_datapoint
  10. add new postprocess_single_output that has logic previously in dump_outputs that renames/removes certain fields. This is now called before interleaved evaluation resolving eval_single / eval_full should support a few more features like pre/postprocessing and remove-thinking #841
  11. parse_eval_args logic is fully removed as it's not needed anymore (batch and interleaved evals are handled in the same place and with the same parameters)
  12. remove_thinking now saves full generation into _full_{generation_key} which allows not overriding _full_generation in judge jobs
  13. remove_thinking is renamed to parse_reasoning and thinking_end is renamed to end_reasoning_string to make the usage clearer

Summary by CodeRabbit

  • New Features

    • Per-datapoint evaluation can run during generation.
  • Documentation

    • CLI examples modernized to use unified ++-prefixed args and many examples add ++remove_thinking=True; guidance warns about reasoning-enabled evaluation.
  • Refactor

    • Evaluation/config flags consolidated into generation arguments (e.g., ++eval_type moved into generation args); command syntax simplified.
  • Behavior change

    • Thinking-token handling standardized and defaults adjusted (remove_thinking behavior changed).

Kipok added 14 commits October 16, 2025 15:11
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 17, 2025

Walkthrough

Consolidates evaluation/config flags into generation arguments, removes the standalone evaluate_results module, introduces a BaseEvaluatorConfig and single-file evaluator workflows, centralizes remove_thinking behavior (dropping thinking_begin), updates generation to support per-datapoint evaluation hooks, and updates many docs and examples to use ++remove_thinking and new ++eval_config/++eval_type CLI syntax.

Changes

Cohort / File(s) Summary
Docs: examples & flags
docs/... (multiple files, e.g., docs/evaluation/*, docs/basics/index.md, docs/agentic_inference/parallel_thinking.md, docs/pipelines/generation.md, docs/releases/*, docs/tutorials/posts/*, recipes/*)
Added ++remove_thinking=True in many examples; replaced --eval_args="..." / --extra_eval_args="..." wrappers with direct ++eval_type / ++eval_config.* flags; adjusted CLI examples and explanatory text.
Datasets: merge EVAL_ARGS → GENERATION_ARGS
nemo_skills/dataset/*/__init__.py, .../prepare.py (many modules)
Removed module-level EVAL_ARGS (and some PROMPT_CONFIG) and appended equivalent ++eval_type=... / eval-config fragments into GENERATION_ARGS; special-case swe-bench changed to GENERATION_MODULE.
Evaluation: remove module & refactor evaluators
nemo_skills/evaluation/evaluate_results.py, nemo_skills/evaluation/evaluator/*, nemo_skills/evaluation/math_grader.py
Deleted evaluate_results.py; removed helpers (unroll_files, evaluate_result); added BaseEvaluatorConfig; refactored many evaluators to single-input-file processing and new function/class signatures; changed top-level evaluate API to evaluate(eval_type, eval_config).
Evaluator configs & APIs
nemo_skills/evaluation/evaluator/*.py (arena, base, bfcl, code, ifbench, ifeval, ioi, livecodebench, math, mcq, mrcr, ojbench, ruler, scicode, ...)
Introduced/used BaseEvaluatorConfig; evaluator config classes now inherit from it; eval_full / async signatures simplified to operate on input_file/single-file workflows; small config fields added/changed (e.g., num_retries, overwrite).
Sandbox / code-exec
nemo_skills/code_execution/sandbox.py
Removed batch_evaluate_results, unroll_files usage, and related proof-eval helpers; dropped tqdm/progress and some imports.
Generation & inference: thinking & eval hooks
nemo_skills/inference/*, nemo_skills/inference/generate.py, nemo_skills/inference/model/parallel_thinking.py
Removed thinking_begin from configs and call sites; changed ParallelThinkingConfig default for remove_thinking; added per-datapoint hooks (postprocess_single_output, evaluate_single_datapoint) and a batch fallback (run_batch_evaluation); wiring to initialize and call evaluators.
Pipeline & CLI: arg consolidation
nemo_skills/pipeline/*.py, nemo_skills/pipeline/utils/*
Removed eval_args / extra_eval_args options and parsing helpers; consolidated evaluation config into generation_args; simplified get_generation_cmd to drop eval-specific branch and to append extra arguments only when present.
Utils: remove_thinking & logging
nemo_skills/utils.py
Rewrote remove_thinking to drop thinking_begin, use dynamic _full_{generation_key}, warn when end-tag missing; added get_logger_name() and module logger.
Inference eval modules: config changes
nemo_skills/inference/eval/{bfcl,scicode,swebench}.py
Removed dataclass remove_thinking/thinking_begin fields in places; adjusted calls to remove_thinking; added eval-related config fields and renamed hooks (e.g., evaluate_single_datapoint).
Tests & scripts
tests/*, tests/slurm-tests/*, tests/gpu-tests/*, tests/test_configs.py, tests/slurm-tests/run_all.sh, recipes/*
Updated tests and slurm scripts to new CLI syntax (++eval_type, ++remove_thinking=True); removed a test referencing evaluate_results; added RUN_NAME and sequential execution in run_all.sh; minor test adjustments.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Pipeline
    participant Generator
    participant Evaluator
    participant Sandbox

    User->>CLI: run generate with ++eval_type / ++eval_config / ++remove_thinking
    CLI->>Pipeline: parse generation_args (includes eval config)
    Pipeline->>Generator: start generation task (per-datapoint)
    alt per-datapoint evaluation enabled
        Generator->>Sandbox: generate datapoint
        Generator->>Generator: postprocess_single_output(datapoint)
        Generator->>Evaluator: evaluate_single_datapoint(datapoint)
        Evaluator->>Sandbox: run evaluation (single-file or per-datapoint)
        Evaluator-->>Generator: metrics
        Generator->>Pipeline: write augmented output
    else batch-only evaluation
        loop generate datapoints
            Generator->>Sandbox: generate
            Generator->>Generator: postprocess_single_output()
            Generator->>Pipeline: append output
        end
        Pipeline->>Evaluator: run_batch_evaluation(output_file)
        Evaluator->>Sandbox: run evaluation on file
        Evaluator-->>Pipeline: results
    end
    Pipeline-->>User: finished
Loading
sequenceDiagram
    participant Old as "Old evaluate_results"
    participant New as "New evaluator path"
    participant Eval as "Evaluator"

    Old->>Eval: evaluate(cfg)  -- (removed module)
    Note right of New: New flow embeds eval into generation args\nand uses BaseEvaluatorConfig per-file
    New->>Eval: evaluate(eval_type, eval_config)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through flags and stitched them well,
Merged evals to gen where quiet configs dwell,
Trimmed thinking-tags, one-file leaps ahead,
Docs and scripts refreshed, the pathways led,
A tidy warren of flags — I humbly tread.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True" directly addresses the two primary changes in this changeset. The first part—moving evaluation inside the generation class—is clearly demonstrated by the removal of evaluate_results.py and the integration of evaluation logic into the generation class with new methods like evaluate_single_datapoint and run_batch_evaluation. The second part—enforcing empty generations when remove_thinking=True—is reflected in the updated remove_thinking behavior in utils.py that sets generation to empty when the thinking_end marker is not found. The title is concise, specific, and clearly conveys the main architectural and behavioral changes that a teammate scanning the PR history would understand.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/reasoning-on-arg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
nemo_skills/inference/model/parallel_thinking.py (1)

221-237: Critical: Accessing removed thinking_begin field causes AttributeError.

Lines 227-228 reference self.cfg.thinking_begin, but this field was removed from ParallelThinkingConfig per the PR objectives. This will cause a runtime AttributeError when filter_incomplete_solutions is True (the default).

The filtering logic needs to be updated to align with the new approach that only checks for thinking_end. Based on the PR objectives and the remove_thinking utility behavior, incomplete solutions should be identified by the absence of thinking_end when reasoning is expected.

Apply this diff to fix the logic:

         # Filter out incomplete solutions if specified
         if self.cfg.filter_incomplete_solutions:
             # Remove unfinished solutions
             filtered_solutions = []
             for solution in solutions:
-                # Check if thinking_begin is in the solution and thinking_end is not in the solution
-                if (
-                    self.cfg.thinking_begin in solution[self.cfg.solution_key]
-                    and self.cfg.thinking_end not in solution[self.cfg.solution_key]
-                ):
-                    continue
-                else:
-                    filtered_solutions.append(solution)
+                # Check if thinking_end is not in the solution when remove_thinking is enabled
+                if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
+                    # Incomplete solution - missing thinking_end marker
+                    continue
+                filtered_solutions.append(solution)
 
             if len(filtered_solutions) < len(solutions):
                 LOG.info(f"Filtered out {len(solutions) - len(filtered_solutions)} incomplete solutions")

Note: This fix assumes incomplete solutions should only be filtered when remove_thinking=True, since that's when the thinking_end marker is expected. You may want to verify this logic aligns with your intended behavior.

nemo_skills/inference/eval/bfcl.py (1)

386-391: Critical: Reference to removed field will cause runtime error.

According to the AI summary, the remove_thinking field was removed from BFCLGenerationConfig, but Line 386 still references self.cfg.remove_thinking. This will raise an AttributeError at runtime when this code path is executed.

The TODO comment on Line 387 acknowledges this needs refactoring. Please either:

  1. Keep the remove_thinking field in BFCLGenerationConfig (revert its removal), or
  2. Remove this entire conditional block and the _remove_thinking_from_message_content call, or
  3. Access remove_thinking from a different source (e.g., parent class or generation config)

Apply one of these fixes before merging:

Option 1: Keep the field (if needed for BFCL)

 @nested_dataclass(kw_only=True)
 class BFCLGenerationConfig(GenerateSolutionsConfig):
     """BFCL benchmark generation."""
+    remove_thinking: bool = False

Option 2: Remove the conditional (if not needed)

-                if self.cfg.remove_thinking:
-                    # TODO: replace with main remove_thinking method
-                    trimmed_response_text = self._remove_thinking_from_message_content(
-                        self.message_parser.get_response_text(model_response["message"])
-                    )
-                    self.message_parser.set_response_text(model_response["message"], trimmed_response_text)
nemo_skills/inference/generate.py (1)

642-646: Prefilled outputs bypass postprocess/single-eval and lose _async_position → restore_async_order will break

You dump prefilled outputs directly without running postprocess_single_output. Since dump_outputs no longer merges original fields, lines written to -async won’t include the async position key and schema normalization. This will cause a KeyError in restore_async_order and also skip interleaved eval for prefilled rows.

Apply postprocess (and single-eval when enabled) before writing prefilled outputs:

-            # Dump prefilled data first
-            if len(prefilled_data_points) > 0:
-                async with self.output_lock:
-                    self.dump_outputs(prefilled_outputs, prefilled_data_points, fout)
+            # Dump prefilled data first (normalize schema and optionally interleave eval)
+            if len(prefilled_data_points) > 0:
+                prepared_outputs = []
+                for dp, out in zip(prefilled_data_points, prefilled_outputs):
+                    await self.postprocess_single_output(out, dp)
+                    if self.should_run_evaluation and self.evaluator:
+                        out = await self.evaluate_single_datapoint({**dp, **out})
+                    prepared_outputs.append(out)
+                async with self.output_lock:
+                    self.dump_outputs(prepared_outputs, prefilled_data_points, fout)
🧹 Nitpick comments (9)
nemo_skills/evaluation/evaluator/base.py (1)

55-60: Consider streaming task creation for large files.

All tasks are created upfront (lines 57-60) before any awaiting, which could consume significant memory for large evaluation files. While the semaphore limits concurrent execution, all task objects are instantiated immediately. A streaming approach (creating tasks in batches) could reduce memory pressure.

However, the comment on Line 46 explicitly acknowledges this tradeoff, and the current approach is simpler and likely sufficient for typical evaluation file sizes.

If needed, here's a batch-based approach:

# In the reading section:
lines = []
with open(self.config.input_file, "rt", encoding="utf-8") as fin:
    lines = [json.loads(line) for line in fin]

# Create and process in batches
temp_file = self.config.input_file + "-tmp"
batch_size = 1000  # Adjust based on needs
with open(temp_file, "wt", encoding="utf-8") as f:
    for i in tqdm.tqdm(range(0, len(lines), batch_size), desc=f"Completed Evaluation for {os.path.basename(self.config.input_file)}"):
        batch = lines[i:i+batch_size]
        tasks = [asyncio.create_task(process_line(line)) for line in batch]
        for task in await asyncio.gather(*tasks):
            f.write(json.dumps(task) + "\n")
nemo_skills/evaluation/evaluator/scicode.py (1)

132-139: Consider using atomic write pattern for data integrity.

The current implementation overwrites the input file directly during evaluation. If the process crashes or is interrupted while writing, the original data could be corrupted or partially lost.

Consider adopting the atomic write pattern used in base.py (lines 60-67 in the relevant snippets):

-    jsonl_file = eval_config.input_file
-    with open(jsonl_file, "rt", encoding="utf-8") as fin:
-        data = [json.loads(line) for line in fin]
-    status_lists = test_code(eval_config, data)
-    with open(jsonl_file, "wt", encoding="utf-8") as fout:
-        for idx, elem in enumerate(data):
-            elem["eval_status"] = status_lists[idx]
-            fout.write(json.dumps(elem) + "\n")
+    jsonl_file = eval_config.input_file
+    with open(jsonl_file, "rt", encoding="utf-8") as fin:
+        data = [json.loads(line) for line in fin]
+    status_lists = test_code(eval_config, data)
+    
+    temp_file = jsonl_file + "-tmp"
+    with open(temp_file, "wt", encoding="utf-8") as fout:
+        for idx, elem in enumerate(data):
+            elem["eval_status"] = status_lists[idx]
+            fout.write(json.dumps(elem) + "\n")
+    
+    os.replace(temp_file, jsonl_file)

Don't forget to add import os at the top of the file if not already present.

nemo_skills/dataset/swe-bench/__init__.py (1)

15-20: Clarify the intent of removing GENERATION_ARGS from swe-bench.

The concern about breaking changes is overstated: EVAL_ARGS is unused throughout the codebase (zero references), and GENERATION_ARGS is fetched at line 126 in nemo_skills/pipeline/utils/eval.py with a default value of "", so missing the constant won't cause errors—it will simply default to an empty string.

However, removing GENERATION_ARGS creates an inconsistency: all 60+ other datasets define it in their __init__.py files. Either add GENERATION_ARGS to swe-bench to maintain consistency or document why swe-bench is an exception.

nemo_skills/evaluation/evaluator/ifbench.py (1)

21-59: Verify that sample lengths match before zipping.

The refactor to single-file workflow using BaseEvaluatorConfig is well-structured. However, the zip() calls at Lines 46 and 51 should use strict=True (Python 3.10+) to catch length mismatches between samples and eval_results.

Apply this diff to add strict validation:

-    for sample, eval_result in zip(samples, eval_results):
+    for sample, eval_result in zip(samples, eval_results, strict=True):
         sample["loose_eval"] = eval_result

     with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f:
         eval_results = [json.loads(line) for line in f]
-    for sample, eval_result in zip(samples, eval_results):
+    for sample, eval_result in zip(samples, eval_results, strict=True):
         sample["strict_eval"] = eval_result
nemo_skills/inference/generate.py (5)

336-344: Avoid double instantiation of evaluator (perf/side effects)

supports_single_eval() constructs an evaluator instance, and then you construct it again with get_evaluator_class(). Instantiate once and query supports_single_eval() on that instance.

@@
-from nemo_skills.evaluation.evaluator import (
-    evaluate,
-    get_evaluator_class,
-    supports_single_eval,
-)
+from nemo_skills.evaluation.evaluator import (
+    evaluate,
+    get_evaluator_class,
+)
@@
-        if self.should_run_evaluation:
-            self.cfg.eval_config = dict(self.cfg.eval_config)
-            if supports_single_eval(self.cfg.eval_type, self.cfg.eval_config):
-                LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.")
-                self.evaluator = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config)
+        if self.should_run_evaluation:
+            self.cfg.eval_config = dict(self.cfg.eval_config)
+            eval_inst = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config)
+            if eval_inst.supports_single_eval():
+                LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.")
+                self.evaluator = eval_inst

Also applies to: 35-39


537-540: Don’t mutate original_data_point during postprocess

Popping keys from the original data point introduces surprising side effects. Compute a filtered copy instead.

-        for key in output:
-            original_data_point.pop(key, None)
-        output.update(original_data_point)
+        # Avoid mutating the input datapoint; only merge non-conflicting keys
+        filtered_original = {k: v for k, v in original_data_point.items() if k not in output}
+        output.update(filtered_original)

589-595: Throttle per-datapoint evaluation to avoid unbounded concurrency

Interleaved eval runs concurrently across all tasks, independent of LLM request throttling. If evaluators are heavy (e.g., spawn subprocesses), this can exhaust resources.

Option A (simple): reuse the existing semaphore.

-        eval_results = await self.evaluator.eval_single(data_point)
+        async with self.semaphore:
+            eval_results = await self.evaluator.eval_single(data_point)

Option B (preferred): add a dedicated eval semaphore (e.g., cfg.eval_max_concurrent_requests) and guard with it here.


522-524: dump_outputs no longer uses data_points parameter

Now that merging is handled in postprocess_single_output, the data_points arg is unused. Consider removing it from the signature and call sites to reduce confusion.


729-731: Guard batch evaluation when output file is missing

If there’s no data and no existing output file, batch evaluation will likely fail. Add a simple existence check.

-        if self.should_run_evaluation and self.evaluator is None:
-            self.run_batch_evaluation()
+        if self.should_run_evaluation and self.evaluator is None:
+            if Path(self.cfg.output_file).exists():
+                self.run_batch_evaluation()
+            else:
+                LOG.info("Skipping batch evaluation: output file not found: %s", self.cfg.output_file)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd862d0 and 9bc6920.

📒 Files selected for processing (107)
  • docs/agentic_inference/parallel_thinking.md (3 hunks)
  • docs/basics/index.md (1 hunks)
  • docs/evaluation/code.md (2 hunks)
  • docs/evaluation/index.md (3 hunks)
  • docs/evaluation/multilingual.md (1 hunks)
  • docs/evaluation/natural-math.md (1 hunks)
  • docs/evaluation/robustness.md (1 hunks)
  • docs/evaluation/tool-calling.md (3 hunks)
  • docs/pipelines/generation.md (2 hunks)
  • docs/releases/opencodereasoning/evaluation.md (1 hunks)
  • docs/releases/openmathinstruct2/dataset.md (2 hunks)
  • docs/releases/openmathreasoning/evaluation.md (4 hunks)
  • docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md (8 hunks)
  • docs/tutorials/posts/nemotron-nano-v2-evals.md (6 hunks)
  • docs/tutorials/posts/omr-simple-recipe.md (1 hunks)
  • nemo_skills/code_execution/sandbox.py (1 hunks)
  • nemo_skills/dataset/aai/__init__.py (1 hunks)
  • nemo_skills/dataset/aalcr/__init__.py (1 hunks)
  • nemo_skills/dataset/aime24/__init__.py (1 hunks)
  • nemo_skills/dataset/aime25/__init__.py (1 hunks)
  • nemo_skills/dataset/algebra222/__init__.py (1 hunks)
  • nemo_skills/dataset/amc23/__init__.py (1 hunks)
  • nemo_skills/dataset/answer-judge/__init__.py (0 hunks)
  • nemo_skills/dataset/arena-hard/__init__.py (1 hunks)
  • nemo_skills/dataset/asdiv/__init__.py (1 hunks)
  • nemo_skills/dataset/beyond-aime/__init__.py (1 hunks)
  • nemo_skills/dataset/bfcl_v3/prepare.py (1 hunks)
  • nemo_skills/dataset/bigcodebench/__init__.py (1 hunks)
  • nemo_skills/dataset/brumo25/__init__.py (1 hunks)
  • nemo_skills/dataset/college_math/__init__.py (1 hunks)
  • nemo_skills/dataset/comp-math-24-25/__init__.py (1 hunks)
  • nemo_skills/dataset/flores200/__init__.py (1 hunks)
  • nemo_skills/dataset/gaokao2023en/__init__.py (1 hunks)
  • nemo_skills/dataset/gpqa/__init__.py (1 hunks)
  • nemo_skills/dataset/gsm-plus/__init__.py (1 hunks)
  • nemo_skills/dataset/gsm8k/__init__.py (1 hunks)
  • nemo_skills/dataset/hle/__init__.py (1 hunks)
  • nemo_skills/dataset/hmmt_feb25/__init__.py (1 hunks)
  • nemo_skills/dataset/human-eval-infilling/__init__.py (1 hunks)
  • nemo_skills/dataset/human-eval/__init__.py (1 hunks)
  • nemo_skills/dataset/ifbench/__init__.py (1 hunks)
  • nemo_skills/dataset/ifeval/__init__.py (1 hunks)
  • nemo_skills/dataset/ioi24/__init__.py (1 hunks)
  • nemo_skills/dataset/ioi25/__init__.py (1 hunks)
  • nemo_skills/dataset/livebench-coding/__init__.py (1 hunks)
  • nemo_skills/dataset/livecodebench-cpp/__init__.py (1 hunks)
  • nemo_skills/dataset/livecodebench-pro/__init__.py (1 hunks)
  • nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
  • nemo_skills/dataset/math-500/__init__.py (1 hunks)
  • nemo_skills/dataset/math-odyssey/__init__.py (1 hunks)
  • nemo_skills/dataset/math/__init__.py (1 hunks)
  • nemo_skills/dataset/mawps/__init__.py (1 hunks)
  • nemo_skills/dataset/mbpp/__init__.py (1 hunks)
  • nemo_skills/dataset/minerva_math/__init__.py (1 hunks)
  • nemo_skills/dataset/minif2f/__init__.py (1 hunks)
  • nemo_skills/dataset/mmlu-pro/__init__.py (1 hunks)
  • nemo_skills/dataset/mmlu-prox/__init__.py (1 hunks)
  • nemo_skills/dataset/mmlu-redux/__init__.py (1 hunks)
  • nemo_skills/dataset/mmlu/__init__.py (1 hunks)
  • nemo_skills/dataset/mobench/__init__.py (1 hunks)
  • nemo_skills/dataset/mrcr/__init__.py (1 hunks)
  • nemo_skills/dataset/ojbench/__init__.py (1 hunks)
  • nemo_skills/dataset/olympiadbench/__init__.py (1 hunks)
  • nemo_skills/dataset/omni-math/__init__.py (1 hunks)
  • nemo_skills/dataset/proofnet/__init__.py (1 hunks)
  • nemo_skills/dataset/putnam-bench/__init__.py (1 hunks)
  • nemo_skills/dataset/ruler/prepare.py (1 hunks)
  • nemo_skills/dataset/scicode/__init__.py (1 hunks)
  • nemo_skills/dataset/simpleqa/__init__.py (1 hunks)
  • nemo_skills/dataset/supergpqa/__init__.py (1 hunks)
  • nemo_skills/dataset/svamp/__init__.py (1 hunks)
  • nemo_skills/dataset/swe-bench/__init__.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluate_results.py (0 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluator/arena.py (1 hunks)
  • nemo_skills/evaluation/evaluator/base.py (2 hunks)
  • nemo_skills/evaluation/evaluator/bfcl.py (2 hunks)
  • nemo_skills/evaluation/evaluator/code.py (7 hunks)
  • nemo_skills/evaluation/evaluator/ifbench.py (1 hunks)
  • nemo_skills/evaluation/evaluator/ifeval.py (1 hunks)
  • nemo_skills/evaluation/evaluator/ioi.py (2 hunks)
  • nemo_skills/evaluation/evaluator/livecodebench.py (4 hunks)
  • nemo_skills/evaluation/evaluator/math.py (4 hunks)
  • nemo_skills/evaluation/evaluator/mcq.py (2 hunks)
  • nemo_skills/evaluation/evaluator/mrcr.py (2 hunks)
  • nemo_skills/evaluation/evaluator/ojbench.py (2 hunks)
  • nemo_skills/evaluation/evaluator/ruler.py (3 hunks)
  • nemo_skills/evaluation/evaluator/scicode.py (3 hunks)
  • nemo_skills/evaluation/math_grader.py (0 hunks)
  • nemo_skills/inference/eval/bfcl.py (1 hunks)
  • nemo_skills/inference/eval/scicode.py (1 hunks)
  • nemo_skills/inference/eval/swebench.py (2 hunks)
  • nemo_skills/inference/generate.py (10 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (1 hunks)
  • nemo_skills/pipeline/eval.py (0 hunks)
  • nemo_skills/pipeline/generate.py (0 hunks)
  • nemo_skills/pipeline/utils/eval.py (5 hunks)
  • nemo_skills/pipeline/utils/generation.py (1 hunks)
  • nemo_skills/utils.py (1 hunks)
  • recipes/openmathreasoning/scripts/simplified_recipe.py (1 hunks)
  • recipes/openreasoning/eval.py (7 hunks)
  • tests/gpu-tests/test_generate.py (1 hunks)
  • tests/slurm-tests/qwen3_4b_evals/run_test.py (1 hunks)
  • tests/slurm-tests/super_49b_evals/run_test.py (1 hunks)
  • tests/test_configs.py (0 hunks)
  • tests/test_generation.py (1 hunks)
💤 Files with no reviewable changes (6)
  • nemo_skills/pipeline/eval.py
  • nemo_skills/dataset/answer-judge/init.py
  • nemo_skills/pipeline/generate.py
  • nemo_skills/evaluation/math_grader.py
  • tests/test_configs.py
  • nemo_skills/evaluation/evaluate_results.py
🧰 Additional context used
🧬 Code graph analysis (22)
recipes/openmathreasoning/scripts/simplified_recipe.py (1)
nemo_skills/pipeline/cli.py (1)
  • wrap_arguments (42-51)
nemo_skills/evaluation/evaluator/scicode.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (1)
  • nested_dataclass (63-96)
recipes/openreasoning/eval.py (1)
nemo_skills/pipeline/cli.py (1)
  • wrap_arguments (42-51)
nemo_skills/evaluation/evaluator/arena.py (1)
nemo_skills/utils.py (1)
  • get_logger_name (37-41)
nemo_skills/evaluation/evaluator/ifbench.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (1)
  • get_logger_name (37-41)
nemo_skills/inference/eval/scicode.py (1)
nemo_skills/utils.py (1)
  • remove_thinking (47-60)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/utils.py (1)
  • remove_thinking (47-60)
nemo_skills/evaluation/evaluator/ruler.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (2)
  • get_logger_name (37-41)
  • nested_dataclass (63-96)
nemo_skills/evaluation/evaluator/code.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (2)
  • get_logger_name (37-41)
  • nested_dataclass (63-96)
nemo_skills/evaluation/evaluator/ifeval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (1)
  • get_logger_name (37-41)
nemo_skills/evaluation/evaluator/mcq.py (3)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/evaluation/math_grader.py (1)
  • extract_answer (95-139)
nemo_skills/utils.py (1)
  • get_logger_name (37-41)
nemo_skills/evaluation/evaluator/bfcl.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (1)
  • nested_dataclass (63-96)
nemo_skills/inference/eval/swebench.py (2)
nemo_skills/utils.py (4)
  • get_help_message (335-355)
  • get_logger_name (37-41)
  • nested_dataclass (63-96)
  • setup_logging (99-134)
nemo_skills/inference/generate.py (1)
  • evaluate_single_datapoint (589-595)
nemo_skills/evaluation/evaluator/__init__.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • eval_full (42-72)
nemo_skills/evaluation/evaluator/ioi.py (1)
  • eval_full (377-393)
nemo_skills/evaluation/evaluator/base.py (2)
nemo_skills/utils.py (1)
  • nested_dataclass (63-96)
nemo_skills/evaluation/evaluator/ioi.py (1)
  • eval_full (377-393)
nemo_skills/evaluation/evaluator/ojbench.py (4)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (37-93)
nemo_skills/evaluation/evaluator/livecodebench.py (3)
  • execute_in_sandbox_with_retries (63-91)
  • is_sandbox_available (94-115)
  • sandbox_context (53-60)
nemo_skills/utils.py (1)
  • nested_dataclass (63-96)
nemo_skills/pipeline/utils/eval.py (1)
nemo_skills/inference/generate.py (1)
  • get_generation_default_args (247-255)
nemo_skills/evaluation/evaluator/math.py (4)
nemo_skills/code_execution/proof_utils.py (3)
  • ProofBuildConfig (26-32)
  • build_lean4_proof (93-133)
  • determine_proof_status (136-162)
nemo_skills/code_execution/sandbox.py (1)
  • get_sandbox (339-342)
nemo_skills/evaluation/evaluator/base.py (2)
  • BaseEvaluator (34-91)
  • BaseEvaluatorConfig (27-31)
nemo_skills/evaluation/math_grader.py (2)
  • extract_answer (95-139)
  • math_equal (37-92)
nemo_skills/evaluation/evaluator/ioi.py (3)
nemo_skills/evaluation/evaluator/base.py (3)
  • BaseEvaluator (34-91)
  • BaseEvaluatorConfig (27-31)
  • eval_full (42-72)
nemo_skills/file_utils.py (1)
  • jdump (68-100)
nemo_skills/utils.py (1)
  • nested_dataclass (63-96)
nemo_skills/evaluation/evaluator/mrcr.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (1)
  • get_logger_name (37-41)
nemo_skills/inference/generate.py (3)
nemo_skills/evaluation/evaluator/__init__.py (3)
  • evaluate (112-125)
  • get_evaluator_class (90-100)
  • supports_single_eval (103-109)
nemo_skills/evaluation/evaluator/base.py (2)
  • supports_single_eval (89-91)
  • eval_single (74-87)
nemo_skills/utils.py (1)
  • remove_thinking (47-60)
nemo_skills/evaluation/evaluator/livecodebench.py (3)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (37-93)
nemo_skills/utils.py (1)
  • nested_dataclass (63-96)
🪛 LanguageTool
docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md

[grammar] ~214-~214: There might be a mistake here.
Context: ... not specify ++remove_thinking=True as RULER has predefined setup for those par...

(QB_NEW_EN)


[grammar] ~215-~215: There might be a mistake here.
Context: ...ecify ++remove_thinking=True as RULER has predefined setup for those parameters. ...

(QB_NEW_EN)

docs/evaluation/index.md

[grammar] ~54-~54: There might be a mistake here.
Context: ... evaluation of reasoning models, either provide reasoning parser in server args (e.g. `...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/index.md

53-53: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/ifbench.py

39-39: subprocess call with shell=True identified, security issue

(S602)


46-46: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


51-51: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/code.py

264-264: Avoid specifying long messages outside the exception class

(TRY003)


276-276: Possible hardcoded password assigned to argument: "pass_k"

(S106)


338-338: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/ifeval.py

39-39: subprocess call with shell=True identified, security issue

(S602)


46-46: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


51-51: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/bfcl.py

63-63: subprocess call with shell=True identified, security issue

(S602)


69-69: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

nemo_skills/evaluation/evaluator/ojbench.py

103-103: Use explicit conversion flag

Replace with conversion flag

(RUF010)


105-105: Use explicit conversion flag

Replace with conversion flag

(RUF010)


106-106: Use explicit conversion flag

Replace with conversion flag

(RUF010)


122-122: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/math.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/ioi.py

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/livecodebench.py

217-217: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


262-262: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests

Comment on lines +33 to +39
cmd = (
"cd /opt/benchmarks/IFBench && python -m run_eval "
f"--input_data={jsonl_file} "
f"--input_response_data={jsonl_file} "
f"--output_dir={output_dir} "
)
subprocess.run(cmd, shell=True, check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Security: Review subprocess call with shell=True.

The use of subprocess.run with shell=True at Line 39 poses a security risk if jsonl_file or other variables contain untrusted input. While jsonl_file comes from cfg.input_file (likely controlled), ensure this path is validated or consider using shell=False with proper command construction.

If shell=True is necessary, ensure jsonl_file is properly validated and does not contain special shell characters. Alternatively, refactor to use shell=False:

cmd_list = [
    "python", "-m", "run_eval",
    f"--input_data={jsonl_file}",
    f"--input_response_data={jsonl_file}",
    f"--output_dir={output_dir}",
]
subprocess.run(cmd_list, cwd="/opt/benchmarks/IFBench", check=True)

I'll investigate the security concern by examining the actual code context and how the input parameters are handled.

#!/bin/bash
# First, examine the actual code in the file around lines 33-39
fd -t f "ifbench.py" --exec cat {} \;
#!/bin/bash
# Look at the function context more carefully
rg -nB10 -A10 "subprocess.run.*shell=True" nemo_skills/evaluation/evaluator/ifbench.py
#!/bin/bash
# Find where cfg.input_file is defined or used
rg -n "input_file" nemo_skills/evaluation/evaluator/ifbench.py -B5 -A5
#!/bin/bash
# Check for any validation functions in the module
rg -n "def.*valid|validate|sanitize" nemo_skills/evaluation/evaluator/ifbench.py
#!/bin/bash
# Look at imports and overall module structure
head -50 nemo_skills/evaluation/evaluator/ifbench.py
#!/bin/bash
# Check if there are any validation patterns for file paths in the codebase
rg -n "cfg\.input" nemo_skills/ -C3 | head -50
#!/bin/bash
# Check what happens with the output_dir parameter
rg -n "output_dir" nemo_skills/evaluation/evaluator/ifbench.py -B3 -A3
#!/bin/bash
# Look at the complete function signature and full body
ast-grep --pattern $'def $$$($_) {
  $$$
}'
#!/bin/bash
# Simpler approach: get the whole file to understand context
wc -l nemo_skills/evaluation/evaluator/ifbench.py && echo "---" && cat nemo_skills/evaluation/evaluator/ifbench.py
```bash
#!/bin/bash
# Check for shell metacharacters that could break the command
# Look at how cfg is typically used in this project
rg -n "cfg\." nemo_skills/evaluation/evaluator/ifbench.py | head -20

---

I need to examine the actual code to verify this security concern.












🏁 Script executed:

#!/bin/bash
cat nemo_skills/evaluation/evaluator/ifbench.py

Length of output: 2251


🏁 Script executed:

#!/bin/bash
rg -n "input_file|output_dir|cfg\." nemo_skills/evaluation/evaluator/ifbench.py -B2 -A2

Length of output: 1178


🏁 Script executed:

#!/bin/bash
rg -n "subprocess" nemo_skills/evaluation/evaluator/ifbench.py -B5 -A5

Length of output: 790


🏁 Script executed:

#!/bin/bash
# Check for any validation of paths in this module or parent modules
fd -t f "*.py" nemo_skills/evaluation/evaluator/ --exec rg -l "validate|sanitize|Path|pathlib" {} \;

Length of output: 283


🏁 Script executed:

#!/bin/bash
# Check the config/parameter structure
rg -n "class.*Config|@dataclass" nemo_skills/evaluation/evaluator/ -B2 -A10 | head -80

Length of output: 6010


#!/bin/bash
# Find and inspect BaseEvaluatorConfig to see if input_file is validated
fd "base.py" nemo_skills/evaluation/evaluator/ -exec cat {} \;
#!/bin/bash
# Check if input_file appears to have any validation in the base config or docs
rg -n "input_file" nemo_skills/evaluation/evaluator/base.py -B5 -A5

---



</details>

**Use `shell=False` with `cwd` parameter to eliminate shell injection risk.**

The current subprocess call at line 39 interpolates `jsonl_file` and `output_dir` directly into a shell command string. If these paths contain spaces or shell metacharacters (e.g., `;`, `$`, backticks), the command will parse incorrectly or execute unintended code. This is a recognized security vulnerability (CWE-78).

Refactor to avoid shell parsing entirely:

```python
cmd_list = [
    "python", "-m", "run_eval",
    f"--input_data={jsonl_file}",
    f"--input_response_data={jsonl_file}",
    f"--output_dir={output_dir}",
]
subprocess.run(cmd_list, cwd="/opt/benchmarks/IFBench", check=True)

This approach uses cwd parameter instead of cd and avoids shell interpretation of path variables.

🧰 Tools
🪛 Ruff (0.14.0)

39-39: subprocess call with shell=True identified, security issue

(S602)

🤖 Prompt for AI Agents
nemo_skills/evaluation/evaluator/ifbench.py around lines 33 to 39: the
subprocess call builds a shell command string with interpolated paths and uses
shell=True, which risks shell injection and misparsing; change to pass an
argument list (e.g., ["python", "-m", "run_eval", f"--input_data={jsonl_file}",
f"--input_response_data={jsonl_file}", f"--output_dir={output_dir}"]) to
subprocess.run, remove the inline "cd" and instead set
cwd="/opt/benchmarks/IFBench", and call subprocess.run with check=True
(shell=False by default) so the command is executed securely without shell
interpretation.

Comment on lines +46 to +47
for sample, eval_result in zip(samples, eval_results):
sample["loose_eval"] = eval_result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add strict=True to zip() for safer pairing.

The zip() call should use strict=True (Python 3.10+) to ensure samples and eval_results have matching lengths and catch potential data misalignment.

Apply this diff:

-    for sample, eval_result in zip(samples, eval_results):
+    for sample, eval_result in zip(samples, eval_results, strict=True):
         sample["loose_eval"] = eval_result

Apply the same fix on line 51:

-    for sample, eval_result in zip(samples, eval_results):
+    for sample, eval_result in zip(samples, eval_results, strict=True):
         sample["strict_eval"] = eval_result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for sample, eval_result in zip(samples, eval_results):
sample["loose_eval"] = eval_result
for sample, eval_result in zip(samples, eval_results, strict=True):
sample["loose_eval"] = eval_result
🧰 Tools
🪛 Ruff (0.14.0)

46-46: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ifeval.py around lines 46-47 (and similarly
at line 51), the zip(samples, eval_results) calls must be made strict to detect
mismatched lengths; update both calls to zip(samples, eval_results, strict=True)
so Python raises an error on length mismatch, ensuring safe pairing and avoiding
silent data misalignment.

Comment on lines +385 to +387
for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add strict=True to zip() and fix missing eval_status field.

Two issues:

  1. The zip() call should use strict=True to ensure all_samples and outputs have matching lengths.
  2. Line 387 references o["eval_status"], but _evaluate_entry (lines 371-375) returns a dict with only name, subtask, and test_case_results fields—eval_status is missing.

Apply this diff:

-    for s, o in zip(all_samples, outputs):
+    for s, o in zip(all_samples, outputs, strict=True):
         s["test_case_results"] = o["test_case_results"]
-        s["eval_status"] = o["eval_status"]

If eval_status is required, update _evaluate_entry to include it in the returned dict at line 371.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]
for s, o in zip(all_samples, outputs, strict=True):
s["test_case_results"] = o["test_case_results"]
🧰 Tools
🪛 Ruff (0.14.0)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 385-387, the
zip(all_samples, outputs) call must use zip(..., strict=True) to ensure lengths
match, and the code currently assigns s["eval_status"] = o["eval_status"] but
_evaluate_entry (around line 371) does not return eval_status; update
_evaluate_entry to compute and include an "eval_status" field in its returned
dict (e.g., derive status from test_case_results or existing logic) so the
caller can safely read o["eval_status"].

Kipok added 2 commits October 16, 2025 22:23
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
nemo_skills/evaluation/evaluator/ruler.py (1)

68-80: Non-atomic write still risks data loss.

This is the same issue flagged in the previous review. The file is still overwritten in place without atomic guarantees. If the process crashes, runs out of disk space, or encounters a write error at line 80, the original data will be lost.

As suggested in the previous review, use a temporary file and os.replace() for atomic writes:

 jsonl_file = cfg.input_file
 with open(jsonl_file, "rt", encoding="utf-8") as fin:
     data = [json.loads(line) for line in fin]
     for sample in tqdm(data):
         parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
         sample["is_correct"] = match_type_funcs[eval_config.match_type](
             sample["generation"], sample["expected_answer"]
         )
         sample["predicted_answer"] = parse_result
 
+import os
+temp_file = jsonl_file + ".tmp"
-with open(jsonl_file, "wt", encoding="utf-8") as fout:
+with open(temp_file, "wt", encoding="utf-8") as fout:
     for sample in data:
         fout.write(json.dumps(sample) + "\n")
+
+os.replace(temp_file, jsonl_file)
🧹 Nitpick comments (5)
nemo_skills/evaluation/evaluator/ruler.py (3)

24-24: Unused logger variable.

The LOG variable is defined but never used in this file.

Consider removing it:

-LOG = logging.getLogger(get_logger_name(__file__))

Or use it to add informative logging, e.g., before processing:

LOG.info(f"Evaluating {jsonl_file} with parse_func={eval_config.parse_func}, match_type={eval_config.match_type}")

27-30: Consider adding field validation for parse_func and match_type.

The config allows any string for parse_func and match_type, but only specific values ("default" for parse_func; "all" or "part" for match_type) are valid. If an invalid value is provided, the evaluation will fail with a KeyError at runtime.

Consider adding validation in the config or early in eval_ruler:

@nested_dataclass(kw_only=True)
class RulerEvaluatorConfig(BaseEvaluatorConfig):
    parse_func: str = "default"
    match_type: str
    
    def __post_init__(self):
        valid_parse_funcs = {"default"}
        valid_match_types = {"all", "part"}
        if self.parse_func not in valid_parse_funcs:
            raise ValueError(f"parse_func must be one of {valid_parse_funcs}, got {self.parse_func}")
        if self.match_type not in valid_match_types:
            raise ValueError(f"match_type must be one of {valid_match_types}, got {self.match_type}")

72-76: Consider adding defensive validation for generation and expected_answer fields.

The test data confirms both fields are present in ruler-specific JSONL files (e.g., ./tests/data/score_rm_inputs/output.jsonl). However, there is no upstream validation at the caller level, and livecodebench.py demonstrates that schema validation is used elsewhere in the codebase. Adding a defensive check—similar to how livecodebench.py uses _preprocess_and_validate_file—would prevent crashes if incorrect data is passed to the evaluator:

 for sample in tqdm(data):
+    if "generation" not in sample or "expected_answer" not in sample:
+        LOG.warning(f"Skipping sample missing required fields: {sample.get('id', 'unknown')}")
+        continue
     parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
nemo_skills/utils.py (2)

37-41: Consider edge cases in path handling.

The current implementation has a few potential edge cases:

  1. If "/nemo_skills/" appears multiple times in the path, split("nemo_skills")[1] uses the first occurrence, which might not be the desired behavior.
  2. Windows paths use backslashes, but the function only checks for forward slashes.
  3. The .replace(".py", "") only removes .py extensions but not other Python-related extensions like .pyx or .pyi.

Consider using Path operations for more robust path handling:

 def get_logger_name(file):
+    file_path = Path(file)
+    file_str = file_path.as_posix()  # Convert to forward slashes
-    if "/nemo_skills/" in file:
-        return "nemo_skills" + file.split("nemo_skills")[1].replace("/", ".").replace(".py", "")
+    if "/nemo_skills/" in file_str:
+        parts = file_str.split("/nemo_skills/", 1)  # Split only once
+        return "nemo_skills." + parts[1].replace("/", ".").removesuffix(file_path.suffix)
     else:
-        return f"[external] {Path(file).stem}"
+        return f"[external] {file_path.stem}"

47-64: Consider rate-limiting the warning to avoid log spam.

The warning on lines 59-64 is logged for every sample that doesn't have the thinking end tag. If many samples are missing the end tag (e.g., when accidentally using a non-reasoning model with remove_thinking=True), this could spam the logs significantly.

Consider one of these approaches:

  1. Log the warning once per execution with a counter
  2. Use a rate-limited logger
  3. Add a summary at the end showing how many samples were affected

Example approach 1:

+_THINKING_END_WARNING_LOGGED = False
+
 def remove_thinking(sample: dict, generation_key: str = "generation", thinking_end: str = "</think>"):
+    global _THINKING_END_WARNING_LOGGED
     # not doing anything if generation isn't a string
     # TODO: should we be more explicit about this?
     if not isinstance(sample[generation_key], str):
         return
     sample[f"_{generation_key}_finished_thinking"] = thinking_end in sample[generation_key]
     if thinking_end in sample[generation_key]:
         sample[f"_full_{generation_key}"] = sample[generation_key]
         sample[generation_key] = sample[generation_key].split(thinking_end)[-1].strip()
     else:
         sample[f"_full_{generation_key}"] = sample[generation_key]
         sample[generation_key] = ""  # no end tag, so setting the generation to empty
-        LOG.warning(
-            "Thinking end tag `%s` not found in generation; setting generation to empty. "
-            "If this happens for every generation, you might have accidentally set ++remove_thinking=True for a "
-            "non-reasoning model or have incorrect end tag.",
-            thinking_end,
-        )
+        if not _THINKING_END_WARNING_LOGGED:
+            LOG.warning(
+                "Thinking end tag `%s` not found in generation; setting generation to empty. "
+                "If this happens for every generation, you might have accidentally set ++remove_thinking=True for a "
+                "non-reasoning model or have incorrect end tag. (Further warnings suppressed)",
+                thinking_end,
+            )
+            _THINKING_END_WARNING_LOGGED = True
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bc6920 and 1dd43cb.

📒 Files selected for processing (5)
  • nemo_skills/dataset/mmlu-prox/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluator/ruler.py (3 hunks)
  • nemo_skills/inference/generate.py (11 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (2 hunks)
  • nemo_skills/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/mmlu-prox/init.py
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/utils.py (1)
  • remove_thinking (47-64)
nemo_skills/evaluation/evaluator/ruler.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (2)
  • get_logger_name (37-41)
  • nested_dataclass (67-100)
nemo_skills/inference/generate.py (3)
nemo_skills/evaluation/evaluator/__init__.py (3)
  • evaluate (112-125)
  • get_evaluator_class (90-100)
  • supports_single_eval (103-109)
nemo_skills/evaluation/evaluator/base.py (2)
  • supports_single_eval (89-91)
  • eval_single (74-87)
nemo_skills/utils.py (1)
  • remove_thinking (47-64)
🪛 Ruff (0.14.0)
nemo_skills/inference/generate.py

644-644: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (11)
nemo_skills/evaluation/evaluator/ruler.py (1)

58-58: LGTM!

The config instantiation correctly uses the new RulerEvaluatorConfig with dictionary unpacking.

nemo_skills/utils.py (1)

44-44: LGTM!

Module-level logger initialization follows standard Python logging patterns.

nemo_skills/inference/generate.py (8)

35-39: LGTM!

Evaluation imports are now centralized and unconditional, which aligns with the new evaluation-into-generation integration approach described in the PR.


180-189: LGTM!

The addition of eval_type and eval_config fields to GenerateSolutionsConfig properly consolidates evaluation configuration into the generation parameters, as described in the PR objectives. The warning comment about remove_thinking is helpful for preventing misuse.


337-343: Evaluator setup looks correct.

The conditional initialization based on supports_single_eval() properly distinguishes between per-datapoint (interleaved) and batch evaluation flows. Note that line 340 mutates self.cfg.eval_config to ensure it's a plain dict—this is likely intentional for compatibility with evaluator classes, but be aware that the original type information is lost.


452-455: LGTM!

The run_batch_evaluation method correctly configures the evaluator to consume the generated output file. Overriding any existing input_file in eval_config is the correct behavior here.


525-545: LGTM with a minor observation.

The postprocess_single_output method correctly handles renaming, stat removal, data merging, and thinking removal. The logic at lines 537-539 properly merges the original data point with the generated output, preserving output values when keys overlap.

Note: This method is declared async but contains no await statements. This is acceptable if it's for future extensibility or to maintain a consistent async API, but consider adding a comment explaining this if it's intentional.


589-595: LGTM!

The evaluate_single_datapoint method correctly times the evaluation and merges results into the data point. The metric name interleaved_eval_single_time_s clearly distinguishes it from batch evaluation timing.


609-613: LGTM!

The pipeline correctly sequences postprocessing before evaluation, which is important since the evaluator expects the postprocessed output format. The conditional check if self.should_run_evaluation and self.evaluator properly handles cases where evaluation is requested but only batch evaluation is supported.


717-737: Verify batch evaluation behavior when no data is processed.

When len(data) == 0 (line 717), generation is skipped but the batch evaluation check on lines 735-736 is still executed. This could attempt to evaluate an empty or non-existent output file.

Please verify this is the intended behavior. If not, consider moving the batch evaluation inside the else block:

         if len(data) == 0:
             LOG.info("No data to process, skipping generation")
         else:
             data = self.preprocess_data(data)
 
             self.log_example_prompt(data)
 
             if self.cfg.dry_run:
                 LOG.info("Exiting without running generation as dry_run flag is set.")
                 return
 
             if not self.cfg.skip_filled:
                 for output_path in [Path(self.cfg.output_file), Path(self.cfg.output_file + "-async")]:
                     if output_path.exists():
                         output_path.unlink()
 
             self.wait_for_server()
             asyncio.run(self.async_loop(data))
-
-        if self.should_run_evaluation and self.evaluator is None:
-            self.run_batch_evaluation()
+            
+            if self.should_run_evaluation and self.evaluator is None:
+                self.run_batch_evaluation()
+                
         self.postprocess()
nemo_skills/inference/model/parallel_thinking.py (1)

152-157: LGTM! Correctly updated remove_thinking calls.

The thinking_begin parameter has been properly removed from both remove_thinking calls, aligning with the updated function signature that now relies solely on thinking_end for processing.

Also applies to: 187-192

Comment on lines +644 to +649
for output, data_point in zip(prefilled_outputs, prefilled_data_points):
await self.postprocess_single_output(output, data_point)

# evaluate single-data point if requested and evaluator supports that
if self.should_run_evaluation and self.evaluator:
output = await self.evaluate_single_datapoint({**data_point, **output})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add strict=True to zip() for safety.

The static analysis tool correctly flags that zip() on line 644 should use an explicit strict= parameter. Without it, if prefilled_outputs and prefilled_data_points have different lengths, the zip will silently truncate to the shorter length, potentially causing data loss.

-                for output, data_point in zip(prefilled_outputs, prefilled_data_points):
+                for output, data_point in zip(prefilled_outputs, prefilled_data_points, strict=True):
                     await self.postprocess_single_output(output, data_point)

Note: strict=True requires Python 3.10+. If supporting earlier versions, add a length check before the loop.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for output, data_point in zip(prefilled_outputs, prefilled_data_points):
await self.postprocess_single_output(output, data_point)
# evaluate single-data point if requested and evaluator supports that
if self.should_run_evaluation and self.evaluator:
output = await self.evaluate_single_datapoint({**data_point, **output})
for output, data_point in zip(prefilled_outputs, prefilled_data_points, strict=True):
await self.postprocess_single_output(output, data_point)
# evaluate single-data point if requested and evaluator supports that
if self.should_run_evaluation and self.evaluator:
output = await self.evaluate_single_datapoint({**data_point, **output})
🧰 Tools
🪛 Ruff (0.14.0)

644-644: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/inference/generate.py around lines 644 to 649, the loop zipping
prefilled_outputs and prefilled_data_points should not silently truncate if
lengths differ; either add zip(..., strict=True) to enforce equal lengths
(requires Python 3.10+) or, if older Python must be supported, check
len(prefilled_outputs) == len(prefilled_data_points) before the loop and raise
or handle the mismatch; update the code to use one of these approaches so
mismatched lengths are detected rather than silently dropped.


remove_thinking: bool = True # Remove thinking tokens from the solution key
thinking_begin: str = "<think>"
remove_thinking: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that the breaking default change is documented.

The default value for remove_thinking has changed from True to False. This is a breaking change that may affect existing configurations that relied on the previous default behavior.

Ensure that:

  1. This change is clearly documented in the PR description, migration guide, or release notes
  2. Existing configurations that depend on thinking removal explicitly set remove_thinking=True

Run the following script to find configurations that might be affected:


🏁 Script executed:

#!/bin/bash
# Description: Find ParallelThinkingConfig usage without explicit remove_thinking setting

# Search for ParallelThinkingConfig instantiations
rg -n -A 10 -B 2 'ParallelThinkingConfig\(' --type=py

# Search for YAML/config files that might configure parallel thinking
fd -e yaml -e yml -e json --exec rg -l 'parallel.*thinking|genselect|gensynthesis' {}

Length of output: 1043


I need to examine the configuration instantiation more closely and inspect the configuration files that might be affected.


🏁 Script executed:

#!/bin/bash
# Understand ParallelThinkingConfig instantiation and parameter filtering

# Get full context around line 91 in __init__.py
echo "=== Context around ParallelThinkingConfig instantiation ==="
rg -n -B 15 -A 5 'ParallelThinkingConfig\(' nemo_skills/inference/model/__init__.py | head -40

# Check the qwq.yaml config file
echo -e "\n=== Content of qwq.yaml ==="
cat ./recipes/openmathreasoning/configs/genselect_sdg/qwq.yaml 2>/dev/null || echo "File not found"

# Search for all config files in recipes that might use parallel_thinking or remove_thinking
echo -e "\n=== All YAML config files in recipes ==="
fd -e yaml -e yml ./recipes --exec rg -l 'remove_thinking|parallel' {} \; 2>/dev/null | head -20

Length of output: 5255


🏁 Script executed:

#!/bin/bash
# Understand ParallelThinkingConfig instantiation and parameter filtering

# Get full context around line 91 in __init__.py
echo "=== Context around ParallelThinkingConfig instantiation ==="
rg -n -B 15 -A 5 'ParallelThinkingConfig\(' nemo_skills/inference/model/__init__.py | head -50

# Check the qwq.yaml config file
echo -e "\n=== Content of qwq.yaml ==="
cat ./recipes/openmathreasoning/configs/genselect_sdg/qwq.yaml 2>/dev/null || echo "File not found"

# Search for valid_params definition
echo -e "\n=== Searching for valid_params definition ==="
rg -n -B 5 -A 5 'valid_params' nemo_skills/inference/model/__init__.py

Length of output: 5873



🏁 Script executed:

#!/bin/bash
# Search for explicit remove_thinking settings

echo "=== Searching for remove_thinking settings in code ==="
rg -n 'remove_thinking\s*=' --type=py -B 2 -A 2

echo -e "\n=== Searching for remove_thinking in YAML configs ==="
fd -e yaml -e yml -exec rg -l 'remove_thinking' {} \;

echo -e "\n=== Check ParallelThinkingConfig dataclass definition ==="
rg -n -B 2 -A 15 'class ParallelThinkingConfig' --type=py

echo -e "\n=== Search for where parallel_thinking_config is passed in code ==="
rg -n 'remove_thinking' nemo_skills/inference/model/parallel_thinking.py | head -20

Length of output: 7074


🏁 Script executed:

#!/bin/bash
# Search for explicit remove_thinking settings

echo "=== Searching for remove_thinking settings in code ==="
rg -n 'remove_thinking\s*=' --type=py -B 2 -A 2

echo -e "\n=== Searching for remove_thinking in YAML configs ==="
fd -e yaml -e yml -exec rg -l 'remove_thinking' {} \;

echo -e "\n=== Check ParallelThinkingConfig dataclass definition ==="
rg -n -B 2 -A 20 'class ParallelThinkingConfig' --type=py

echo -e "\n=== Look at actual usage of remove_thinking in parallel_thinking.py ==="
rg -n 'self.cfg.remove_thinking' nemo_skills/inference/model/parallel_thinking.py

Length of output: 7356


🏁 Script executed:

#!/bin/bash
# Verify the remove_thinking function behavior to confirm logic bug

echo "=== Remove thinking function implementation ==="
rg -n -A 25 'def remove_thinking' nemo_skills/utils.py

echo -e "\n=== Context around line 226 in parallel_thinking.py ==="
sed -n '220,235p' nemo_skills/inference/model/parallel_thinking.py

Length of output: 2153


Fix critical logic error on line 226 that filters out all solutions when remove_thinking=True.

After remove_thinking() is called, the thinking_end tag should never remain in the generation. The function either splits it out (line 55 in utils.py) or sets generation to empty (line 58). Therefore, the condition on line 226 will always be true, causing all solutions to be filtered when remove_thinking=True.

The intent appears to be filtering unfinished solutions. Use the _{solution_key}_finished_thinking flag set by remove_thinking() instead:

# filtering unfinished solutions  
if self.cfg.remove_thinking and not solution.get(f"_{self.cfg.solution_key}_finished_thinking", True):
    continue

Regarding the default value change on line 54: No breaking change impact. All 9 existing uses explicitly override with ++remove_thinking=True via Hydra configuration.

Comment on lines 225 to 227
# filtering unfinished solutions
if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Filtering logic will incorrectly remove all solutions.

After remove_thinking is called (lines 152-157 in generate_solutions and lines 187-192 in _load_solutions), the thinking_end marker is never present in the processed generation because:

  • If thinking_end was present: remove_thinking splits on it and takes the text after
  • If thinking_end was NOT present: remove_thinking sets generation to an empty string

Therefore, thinking_end not in solution[self.cfg.solution_key] will always be True after remove_thinking is applied, causing all solutions to be filtered out when remove_thinking=True.

To filter incomplete solutions correctly, check the flag set by remove_thinking:

         # Remove unfinished solutions
         filtered_solutions = []
         for solution in solutions:
-            # filtering unfinished solutions
-            if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
+            # Filter out solutions where thinking was incomplete (thinking_end not found)
+            if self.cfg.remove_thinking and not solution["output_dict"].get(f"_{self.cfg.solution_key}_finished_thinking", True):
                 continue
             else:
                 filtered_solutions.append(solution)

Alternatively, check if the generation is empty:

-            if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
+            if self.cfg.remove_thinking and not solution[self.cfg.solution_key]:
                 continue

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In nemo_skills/inference/model/parallel_thinking.py around lines 225 to 227, the
filter condition uses "thinking_end not in solution[self.cfg.solution_key]"
which will always be true after remove_thinking has run (it either strips the
marker or leaves an empty string), so all solutions are dropped; instead, change
the logic to detect incomplete solutions using the flag set by remove_thinking
(e.g., a boolean returned or stored when stripping thinking) or simply check for
an empty generation string (e.g., skip if solution[self.cfg.solution_key] == ""
or length == 0), or, if you need to inspect the original text, use the
pre-processed/original generation field to test for the presence of thinking_end
before remove_thinking was applied.

Kipok added 6 commits October 17, 2025 10:00
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/inference/model/parallel_thinking.py (1)

370-398: Critical: Missing early return causes crash when all solutions are filtered out.

When solutions is empty (e.g., all filtered by solution_length_cap), lines 371-386 create a sensible default output_dict. However, execution continues to Step 2 (lines 388-398), which overwrites output_dict and attempts to run genselect/gensynthesis on an empty solutions list, causing crashes:

  • GenSelect (line 326): local_random.randint(0, -1) raises ValueError
  • GenSynthesis (line 349): local_random.choice([]) raises IndexError

Add an early return after creating the default output:

         if not solutions:
             output_dict = {
                 self.cfg.solution_key: "",
                 "solution_list": [],
                 f"{self.cfg.mode}_comparison": "",
                 f"{self.cfg.mode}_num_generated_tokens": 0,
                 "total_solution_generated_tokens": total_num_generated_tokens,
                 "num_generated_tokens": total_num_generated_tokens,  # No additional tokens for genselect
                 "num_best_solution_generated_tokens": 0,
             }
 
             # Required by inference/generate.py
             output_dict["generation"] = ""
             if self.cfg.count_prompt_tokens:
                 # The input doesn't make sense for such cases where there are no solutions
                 output_dict["num_input_tokens"] = None
+            return output_dict
 
         # Step 2: Run GenSelect/GenSynthesis
♻️ Duplicate comments (1)
nemo_skills/inference/generate.py (1)

647-652: Add strict=True to zip() for safety.

This issue has already been raised in a previous review. The zip() call should use strict=True to prevent silent data loss if the lists have mismatched lengths.

🧹 Nitpick comments (6)
tests/slurm-tests/run_all.sh (1)

3-14: Consider adding a usage comment for script arguments.

While the code is correct and aligns well with the PR's parameter consolidation, adding a brief comment at the top explaining the positional arguments would aid future maintainers. For example:

 #!/bin/bash
 
+# Usage: ./run_all.sh <CLUSTER> [RUN_NAME]
+# CLUSTER: target cluster (required)
+# RUN_NAME: test run identifier; defaults to YYYY-MM-DD (optional)
+
 CLUSTER=$1
 RUN_NAME=${2:-$(date +%Y-%m-%d)}
nemo_skills/inference/eval/swebench.py (1)

150-152: Consider the design of rejected configuration fields.

These fields are added to the config but immediately rejected in __init__ (lines 176-179). This creates a potentially confusing API where users might attempt to set eval_type only to receive an error.

If these fields are present solely for interface consistency with the base GenerationTask class, consider adding a comment explaining this design decision. Alternatively, if they serve no purpose, they could be omitted entirely from this configuration class.

nemo_skills/evaluation/evaluator/ruler.py (1)

69-83: Atomic write implementation correctly addresses the previous data loss concern.

The new implementation using a temporary file and os.replace() prevents data corruption if the process is interrupted. This resolves the critical issue flagged in the previous review.

However, there are two optional improvements to consider:

  1. Processing loop placement: Lines 72-77 process data inside the file reading context manager, but the file is no longer needed after line 71. Moving the loop outside would be clearer.

  2. Temp file cleanup: If an exception occurs during write or replace (lines 79-83), the temp file will be left behind. Adding a try-finally block would ensure cleanup.

Optional refactor to address both points:

 jsonl_file = cfg.input_file
 with open(jsonl_file, "rt", encoding="utf-8") as fin:
     data = [json.loads(line) for line in fin]
-    for sample in tqdm(data):
-        parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
-        sample["is_correct"] = match_type_funcs[eval_config.match_type](
-            sample["generation"], sample["expected_answer"]
-        )
-        sample["predicted_answer"] = parse_result
 
+for sample in tqdm(data):
+    parse_result = parse_funcs[eval_config.parse_func](sample["generation"])
+    sample["is_correct"] = match_type_funcs[eval_config.match_type](
+        sample["generation"], sample["expected_answer"]
+    )
+    sample["predicted_answer"] = parse_result
+
+temp_file = jsonl_file + "-tmp"
+try:
-    with open(jsonl_file + "-tmp", "wt", encoding="utf-8") as fout:
+    with open(temp_file, "wt", encoding="utf-8") as fout:
         for sample in data:
             fout.write(json.dumps(sample) + "\n")
-
-os.replace(jsonl_file + "-tmp", jsonl_file)
+    os.replace(temp_file, jsonl_file)
+except Exception:
+    if os.path.exists(temp_file):
+        os.remove(temp_file)
+    raise
nemo_skills/pipeline/utils/eval.py (1)

168-171: Environment pollution TODO is a valid concern.

Setting os.environ["NEMO_SKILLS_PRIVILEGED_DOCKER"] at line 170 persists across function calls and could affect other benchmarks processed in the same run. Consider scoping this to the specific swe-bench job or resetting it after use.

Would you like me to open an issue to track proper isolation of this environment variable, or suggest a pattern using context managers for temporary environment changes?

nemo_skills/inference/generate.py (1)

528-548: Consider making postprocess_single_output synchronous.

The method is declared async but performs no await operations. Unless async is needed for future extensibility, consider making it a regular synchronous method to simplify the call sites.

Apply this diff if you want to make it synchronous:

-    async def postprocess_single_output(self, output, original_data_point):
+    def postprocess_single_output(self, output, original_data_point):

Then update call sites at lines 612, 648 to remove await:

-        await self.postprocess_single_output(output, data_point)
+        self.postprocess_single_output(output, data_point)
nemo_skills/inference/model/parallel_thinking.py (1)

74-74: Remove unused configuration field.

The filter_incomplete_solutions field is no longer used anywhere in the code. According to the enriched summary, the filtering logic that used this field was intentionally removed. The config field should be cleaned up.

-    filter_incomplete_solutions: bool = True  # Filter out incomplete solutions
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dd43cb and 166a392.

📒 Files selected for processing (8)
  • nemo_skills/evaluation/evaluator/ruler.py (3 hunks)
  • nemo_skills/inference/eval/swebench.py (4 hunks)
  • nemo_skills/inference/generate.py (11 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (1 hunks)
  • nemo_skills/pipeline/generate.py (0 hunks)
  • nemo_skills/pipeline/utils/eval.py (5 hunks)
  • nemo_skills/utils.py (1 hunks)
  • tests/slurm-tests/run_all.sh (1 hunks)
💤 Files with no reviewable changes (1)
  • nemo_skills/pipeline/generate.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/utils.py
🧰 Additional context used
🧬 Code graph analysis (5)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/utils.py (1)
  • remove_thinking (48-65)
nemo_skills/inference/generate.py (3)
nemo_skills/evaluation/evaluator/__init__.py (3)
  • evaluate (112-125)
  • get_evaluator_class (90-100)
  • supports_single_eval (103-109)
nemo_skills/evaluation/evaluator/base.py (2)
  • supports_single_eval (89-91)
  • eval_single (74-87)
nemo_skills/utils.py (1)
  • remove_thinking (48-65)
nemo_skills/inference/eval/swebench.py (2)
nemo_skills/utils.py (4)
  • get_help_message (340-360)
  • get_logger_name (38-42)
  • nested_dataclass (68-101)
  • setup_logging (104-139)
nemo_skills/inference/generate.py (1)
  • evaluate_single_datapoint (592-598)
nemo_skills/evaluation/evaluator/ruler.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/utils.py (2)
  • get_logger_name (38-42)
  • nested_dataclass (68-101)
nemo_skills/pipeline/utils/eval.py (1)
nemo_skills/inference/generate.py (1)
  • get_generation_default_args (247-255)
🪛 Ruff (0.14.0)
nemo_skills/inference/generate.py

647-647: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/inference/eval/swebench.py

177-179: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (24)
tests/slurm-tests/run_all.sh (2)

3-4: Good parameter consolidation with sensible default.

The refactor to make the date/run identifier a configurable second argument is clean and backward compatible. Callers can now override the date if needed (e.g., for re-running failed jobs with the same date), while defaulting to today's date for common use.


6-14: Consistent application of RUN_NAME across all test invocations.

All five test jobs properly use $RUN_NAME in both --workspace and --expname_prefix paths, ensuring a coherent naming scheme across parallel jobs.

nemo_skills/inference/eval/swebench.py (3)

33-38: LGTM: Clean import organization.

The multi-line import format improves readability with no functional change.


176-179: LGTM: Clear validation prevents misconfiguration.

The guard correctly prevents users from attempting to configure evaluation via eval_type, since SWE-bench handles evaluation automatically through its harness. The error message clearly guides users to the correct behavior.


196-198: LGTM: Method rename aligns with standardized evaluation interface.

The rename from apply_evaluation_hook to evaluate_single_datapoint matches the base class pattern (as seen in generate.py). The pass-through implementation is appropriate here since SWE-bench performs evaluation inline during process_single_datapoint (lines 495-564).

nemo_skills/evaluation/evaluator/ruler.py (3)

17-17: LGTM! Imports align with the refactoring.

The new imports (os, BaseEvaluatorConfig) support the atomic write implementation and the updated class hierarchy.

Also applies to: 22-23


29-29: LGTM! Class hierarchy updated correctly.

Inheriting from BaseEvaluatorConfig provides access to input_file, data_dir, and split fields, which aligns with the PR's goal of fusing top-level evaluation config with inner eval_config.


59-59: LGTM! Config construction updated per PR design.

The change from **cfg.eval_config to **cfg reflects the flattened configuration structure where evaluation parameters are now at the top level.

nemo_skills/pipeline/utils/eval.py (4)

129-133: Backward compatibility for deprecated EVAL_ARGS looks good.

The logic correctly prepends legacy EVAL_ARGS to generation_args while injecting the new eval_config.split format. This provides a clean migration path.


223-223: eval_config.data_dir injection is correct.

The conditional addition of eval_config.data_dir to generation_args aligns with the PR's goal to consolidate evaluation configuration into generation arguments.


243-243: eval_config.data_dir injection is correct.

Consistent with line 223, this properly injects eval_config.data_dir into generation_args for single-benchmark paths.


424-428: Argument concatenation order is correct.

The ordering (generation defaults → benchmark args → job args) properly supports override semantics where job-specific arguments can override benchmark and default settings. This aligns well with Hydra's configuration override behavior.

nemo_skills/inference/generate.py (11)

35-39: LGTM!

The evaluation imports are correctly added to support the new integrated evaluation flow.


87-87: LGTM!

Updated docstring appropriately describes the config class.


180-182: LGTM!

The comment appropriately warns users about the remove_thinking behavior change, which is critical for preventing empty generations on non-reasoning models.


187-189: LGTM!

The evaluation configuration fields are properly integrated into the generation config, enabling opt-in evaluation.


337-343: LGTM!

The evaluator initialization logic correctly distinguishes between per-datapoint and batch evaluation modes. The code properly checks if the evaluator supports single evaluation before instantiating it.


401-402: LGTM!

The comment clarifies the intent of the inference override config pattern.


455-458: LGTM!

The batch evaluation method is straightforward and correctly configured. It properly uses the output file as input for evaluation.


524-526: LGTM!

The simplified dump_outputs method now focuses solely on writing, with preprocessing moved to postprocess_single_output.


592-598: LGTM!

The per-datapoint evaluation method correctly times and executes single evaluations, providing performance insights.


612-616: LGTM!

The integration of postprocessing and per-datapoint evaluation is correct. Postprocessing runs before evaluation as intended, and the conditional gating ensures evaluation only runs when appropriate.


720-740: LGTM!

The refactored control flow correctly handles the two evaluation modes:

  • Per-datapoint evaluation when evaluator is available (interleaved with generation)
  • Batch evaluation when eval_type is set but evaluator is None (after generation completes)

The logic is clear and aligns with the PR objectives.

nemo_skills/inference/model/parallel_thinking.py (1)

153-158: LGTM: remove_thinking integration is correct.

The remove_thinking calls properly pass generation_key and thinking_end parameters, and correctly omit the removed thinking_begin parameter. The function modifies the dict in-place to set the _{generation_key}_finished_thinking flag and preserve the full generation in _full_{generation_key}, which aligns with the PR's changes to thinking-token handling.

Also applies to: 195-200

Kipok added 2 commits October 21, 2025 10:51
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_skills/inference/eval/swebench.py (2)

150-152: Clarify or remove unused eval_type field.

The eval_type field definition suggests it's configurable, but the validation in __init__ rejects any non-None value. This creates API confusion for users who might attempt to configure evaluation.

Consider one of the following approaches:

  1. Remove the eval_type and eval_config fields entirely from SweBenchGenerationConfig if they're not applicable to SWE-bench.
  2. Add a comment in the field definition clarifying that these fields exist for interface consistency but are not used for SWE-bench.
  3. Move these fields to a base config class and have SWE-bench override/validate them appropriately.

Apply this diff to remove the unused fields:

-    # Evaluation setup if requested. If eval_type is set to None, evaluation is skipped
-    eval_type: str | None = None  # "lean4-proof", "math", etc.
-    eval_config: dict = field(default_factory=dict)  # Config for the evaluator
-

And remove the corresponding validation:

-        if self.cfg.eval_type is not None:
-            raise ValueError(
-                "SWE-bench generation task does not support eval_type parameter. Evaluation is done automatically."
-            )
-

Also applies to: 176-179


181-182: Remove unused evaluation flags.

The should_run_evaluation and evaluator attributes are initialized but never used in the code. Since SWE-bench performs evaluation inline during generation (in process_single_datapoint), these flags are unnecessary.

Apply this diff to remove the unused attributes:

-        self.should_run_evaluation = False
-        self.evaluator = None
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 166a392 and 3676d7f.

📒 Files selected for processing (4)
  • docs/basics/index.md (1 hunks)
  • nemo_skills/dataset/simpleqa/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluator/mcq.py (2 hunks)
  • nemo_skills/inference/eval/swebench.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • nemo_skills/dataset/simpleqa/init.py
  • nemo_skills/evaluation/evaluator/mcq.py
  • docs/basics/index.md
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/eval/swebench.py (2)
nemo_skills/utils.py (4)
  • get_help_message (340-360)
  • get_logger_name (38-42)
  • nested_dataclass (68-101)
  • setup_logging (104-139)
nemo_skills/inference/generate.py (1)
  • evaluate_single_datapoint (592-598)
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/swebench.py

177-179: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/inference/eval/swebench.py (2)

33-38: LGTM!

The import consolidation improves readability and follows Python conventions.


199-201: LGTM!

The method rename to evaluate_single_datapoint aligns with the base class interface. The no-op implementation is appropriate since SWE-bench performs evaluation inline during process_single_datapoint.

Kipok added 15 commits October 23, 2025 13:36
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok merged commit 7f4687a into main Oct 24, 2025
5 checks passed
@Kipok Kipok deleted the igitman/reasoning-on-arg branch October 24, 2025 05:18
@fayejf fayejf mentioned this pull request Oct 24, 2025
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
…ions when remove_thinking=True (#958)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Jan 15, 2026
This was referenced Feb 26, 2026
@coderabbitai coderabbitai bot mentioned this pull request Mar 5, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
…ions when remove_thinking=True (#958)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant