Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
WalkthroughConsolidates evaluation/config flags into generation arguments, removes the standalone evaluate_results module, introduces a BaseEvaluatorConfig and single-file evaluator workflows, centralizes remove_thinking behavior (dropping thinking_begin), updates generation to support per-datapoint evaluation hooks, and updates many docs and examples to use ++remove_thinking and new ++eval_config/++eval_type CLI syntax. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Pipeline
participant Generator
participant Evaluator
participant Sandbox
User->>CLI: run generate with ++eval_type / ++eval_config / ++remove_thinking
CLI->>Pipeline: parse generation_args (includes eval config)
Pipeline->>Generator: start generation task (per-datapoint)
alt per-datapoint evaluation enabled
Generator->>Sandbox: generate datapoint
Generator->>Generator: postprocess_single_output(datapoint)
Generator->>Evaluator: evaluate_single_datapoint(datapoint)
Evaluator->>Sandbox: run evaluation (single-file or per-datapoint)
Evaluator-->>Generator: metrics
Generator->>Pipeline: write augmented output
else batch-only evaluation
loop generate datapoints
Generator->>Sandbox: generate
Generator->>Generator: postprocess_single_output()
Generator->>Pipeline: append output
end
Pipeline->>Evaluator: run_batch_evaluation(output_file)
Evaluator->>Sandbox: run evaluation on file
Evaluator-->>Pipeline: results
end
Pipeline-->>User: finished
sequenceDiagram
participant Old as "Old evaluate_results"
participant New as "New evaluator path"
participant Eval as "Evaluator"
Old->>Eval: evaluate(cfg) -- (removed module)
Note right of New: New flow embeds eval into generation args\nand uses BaseEvaluatorConfig per-file
New->>Eval: evaluate(eval_type, eval_config)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
nemo_skills/inference/model/parallel_thinking.py (1)
221-237: Critical: Accessing removedthinking_beginfield causes AttributeError.Lines 227-228 reference
self.cfg.thinking_begin, but this field was removed fromParallelThinkingConfigper the PR objectives. This will cause a runtimeAttributeErrorwhenfilter_incomplete_solutionsisTrue(the default).The filtering logic needs to be updated to align with the new approach that only checks for
thinking_end. Based on the PR objectives and theremove_thinkingutility behavior, incomplete solutions should be identified by the absence ofthinking_endwhen reasoning is expected.Apply this diff to fix the logic:
# Filter out incomplete solutions if specified if self.cfg.filter_incomplete_solutions: # Remove unfinished solutions filtered_solutions = [] for solution in solutions: - # Check if thinking_begin is in the solution and thinking_end is not in the solution - if ( - self.cfg.thinking_begin in solution[self.cfg.solution_key] - and self.cfg.thinking_end not in solution[self.cfg.solution_key] - ): - continue - else: - filtered_solutions.append(solution) + # Check if thinking_end is not in the solution when remove_thinking is enabled + if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]: + # Incomplete solution - missing thinking_end marker + continue + filtered_solutions.append(solution) if len(filtered_solutions) < len(solutions): LOG.info(f"Filtered out {len(solutions) - len(filtered_solutions)} incomplete solutions")Note: This fix assumes incomplete solutions should only be filtered when
remove_thinking=True, since that's when thethinking_endmarker is expected. You may want to verify this logic aligns with your intended behavior.nemo_skills/inference/eval/bfcl.py (1)
386-391: Critical: Reference to removed field will cause runtime error.According to the AI summary, the
remove_thinkingfield was removed fromBFCLGenerationConfig, but Line 386 still referencesself.cfg.remove_thinking. This will raise anAttributeErrorat runtime when this code path is executed.The TODO comment on Line 387 acknowledges this needs refactoring. Please either:
- Keep the
remove_thinkingfield inBFCLGenerationConfig(revert its removal), or- Remove this entire conditional block and the
_remove_thinking_from_message_contentcall, or- Access
remove_thinkingfrom a different source (e.g., parent class or generation config)Apply one of these fixes before merging:
Option 1: Keep the field (if needed for BFCL)
@nested_dataclass(kw_only=True) class BFCLGenerationConfig(GenerateSolutionsConfig): """BFCL benchmark generation.""" + remove_thinking: bool = FalseOption 2: Remove the conditional (if not needed)
- if self.cfg.remove_thinking: - # TODO: replace with main remove_thinking method - trimmed_response_text = self._remove_thinking_from_message_content( - self.message_parser.get_response_text(model_response["message"]) - ) - self.message_parser.set_response_text(model_response["message"], trimmed_response_text)nemo_skills/inference/generate.py (1)
642-646: Prefilled outputs bypass postprocess/single-eval and lose _async_position → restore_async_order will breakYou dump prefilled outputs directly without running postprocess_single_output. Since dump_outputs no longer merges original fields, lines written to -async won’t include the async position key and schema normalization. This will cause a KeyError in restore_async_order and also skip interleaved eval for prefilled rows.
Apply postprocess (and single-eval when enabled) before writing prefilled outputs:
- # Dump prefilled data first - if len(prefilled_data_points) > 0: - async with self.output_lock: - self.dump_outputs(prefilled_outputs, prefilled_data_points, fout) + # Dump prefilled data first (normalize schema and optionally interleave eval) + if len(prefilled_data_points) > 0: + prepared_outputs = [] + for dp, out in zip(prefilled_data_points, prefilled_outputs): + await self.postprocess_single_output(out, dp) + if self.should_run_evaluation and self.evaluator: + out = await self.evaluate_single_datapoint({**dp, **out}) + prepared_outputs.append(out) + async with self.output_lock: + self.dump_outputs(prepared_outputs, prefilled_data_points, fout)
🧹 Nitpick comments (9)
nemo_skills/evaluation/evaluator/base.py (1)
55-60: Consider streaming task creation for large files.All tasks are created upfront (lines 57-60) before any awaiting, which could consume significant memory for large evaluation files. While the semaphore limits concurrent execution, all task objects are instantiated immediately. A streaming approach (creating tasks in batches) could reduce memory pressure.
However, the comment on Line 46 explicitly acknowledges this tradeoff, and the current approach is simpler and likely sufficient for typical evaluation file sizes.
If needed, here's a batch-based approach:
# In the reading section: lines = [] with open(self.config.input_file, "rt", encoding="utf-8") as fin: lines = [json.loads(line) for line in fin] # Create and process in batches temp_file = self.config.input_file + "-tmp" batch_size = 1000 # Adjust based on needs with open(temp_file, "wt", encoding="utf-8") as f: for i in tqdm.tqdm(range(0, len(lines), batch_size), desc=f"Completed Evaluation for {os.path.basename(self.config.input_file)}"): batch = lines[i:i+batch_size] tasks = [asyncio.create_task(process_line(line)) for line in batch] for task in await asyncio.gather(*tasks): f.write(json.dumps(task) + "\n")nemo_skills/evaluation/evaluator/scicode.py (1)
132-139: Consider using atomic write pattern for data integrity.The current implementation overwrites the input file directly during evaluation. If the process crashes or is interrupted while writing, the original data could be corrupted or partially lost.
Consider adopting the atomic write pattern used in
base.py(lines 60-67 in the relevant snippets):- jsonl_file = eval_config.input_file - with open(jsonl_file, "rt", encoding="utf-8") as fin: - data = [json.loads(line) for line in fin] - status_lists = test_code(eval_config, data) - with open(jsonl_file, "wt", encoding="utf-8") as fout: - for idx, elem in enumerate(data): - elem["eval_status"] = status_lists[idx] - fout.write(json.dumps(elem) + "\n") + jsonl_file = eval_config.input_file + with open(jsonl_file, "rt", encoding="utf-8") as fin: + data = [json.loads(line) for line in fin] + status_lists = test_code(eval_config, data) + + temp_file = jsonl_file + "-tmp" + with open(temp_file, "wt", encoding="utf-8") as fout: + for idx, elem in enumerate(data): + elem["eval_status"] = status_lists[idx] + fout.write(json.dumps(elem) + "\n") + + os.replace(temp_file, jsonl_file)Don't forget to add
import osat the top of the file if not already present.nemo_skills/dataset/swe-bench/__init__.py (1)
15-20: Clarify the intent of removingGENERATION_ARGSfrom swe-bench.The concern about breaking changes is overstated:
EVAL_ARGSis unused throughout the codebase (zero references), andGENERATION_ARGSis fetched at line 126 innemo_skills/pipeline/utils/eval.pywith a default value of"", so missing the constant won't cause errors—it will simply default to an empty string.However, removing
GENERATION_ARGScreates an inconsistency: all 60+ other datasets define it in their__init__.pyfiles. Either addGENERATION_ARGSto swe-bench to maintain consistency or document why swe-bench is an exception.nemo_skills/evaluation/evaluator/ifbench.py (1)
21-59: Verify that sample lengths match before zipping.The refactor to single-file workflow using
BaseEvaluatorConfigis well-structured. However, thezip()calls at Lines 46 and 51 should usestrict=True(Python 3.10+) to catch length mismatches between samples and eval_results.Apply this diff to add strict validation:
- for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["loose_eval"] = eval_result with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f: eval_results = [json.loads(line) for line in f] - for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["strict_eval"] = eval_resultnemo_skills/inference/generate.py (5)
336-344: Avoid double instantiation of evaluator (perf/side effects)supports_single_eval() constructs an evaluator instance, and then you construct it again with get_evaluator_class(). Instantiate once and query supports_single_eval() on that instance.
@@ -from nemo_skills.evaluation.evaluator import ( - evaluate, - get_evaluator_class, - supports_single_eval, -) +from nemo_skills.evaluation.evaluator import ( + evaluate, + get_evaluator_class, +) @@ - if self.should_run_evaluation: - self.cfg.eval_config = dict(self.cfg.eval_config) - if supports_single_eval(self.cfg.eval_type, self.cfg.eval_config): - LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.") - self.evaluator = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config) + if self.should_run_evaluation: + self.cfg.eval_config = dict(self.cfg.eval_config) + eval_inst = get_evaluator_class(self.cfg.eval_type, self.cfg.eval_config) + if eval_inst.supports_single_eval(): + LOG.info("Evaluator supports per-datapoint evals, will interleave evaluation with generation.") + self.evaluator = eval_instAlso applies to: 35-39
537-540: Don’t mutate original_data_point during postprocessPopping keys from the original data point introduces surprising side effects. Compute a filtered copy instead.
- for key in output: - original_data_point.pop(key, None) - output.update(original_data_point) + # Avoid mutating the input datapoint; only merge non-conflicting keys + filtered_original = {k: v for k, v in original_data_point.items() if k not in output} + output.update(filtered_original)
589-595: Throttle per-datapoint evaluation to avoid unbounded concurrencyInterleaved eval runs concurrently across all tasks, independent of LLM request throttling. If evaluators are heavy (e.g., spawn subprocesses), this can exhaust resources.
Option A (simple): reuse the existing semaphore.
- eval_results = await self.evaluator.eval_single(data_point) + async with self.semaphore: + eval_results = await self.evaluator.eval_single(data_point)Option B (preferred): add a dedicated eval semaphore (e.g., cfg.eval_max_concurrent_requests) and guard with it here.
522-524: dump_outputs no longer usesdata_pointsparameterNow that merging is handled in postprocess_single_output, the
data_pointsarg is unused. Consider removing it from the signature and call sites to reduce confusion.
729-731: Guard batch evaluation when output file is missingIf there’s no data and no existing output file, batch evaluation will likely fail. Add a simple existence check.
- if self.should_run_evaluation and self.evaluator is None: - self.run_batch_evaluation() + if self.should_run_evaluation and self.evaluator is None: + if Path(self.cfg.output_file).exists(): + self.run_batch_evaluation() + else: + LOG.info("Skipping batch evaluation: output file not found: %s", self.cfg.output_file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (107)
docs/agentic_inference/parallel_thinking.md(3 hunks)docs/basics/index.md(1 hunks)docs/evaluation/code.md(2 hunks)docs/evaluation/index.md(3 hunks)docs/evaluation/multilingual.md(1 hunks)docs/evaluation/natural-math.md(1 hunks)docs/evaluation/robustness.md(1 hunks)docs/evaluation/tool-calling.md(3 hunks)docs/pipelines/generation.md(2 hunks)docs/releases/opencodereasoning/evaluation.md(1 hunks)docs/releases/openmathinstruct2/dataset.md(2 hunks)docs/releases/openmathreasoning/evaluation.md(4 hunks)docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md(8 hunks)docs/tutorials/posts/nemotron-nano-v2-evals.md(6 hunks)docs/tutorials/posts/omr-simple-recipe.md(1 hunks)nemo_skills/code_execution/sandbox.py(1 hunks)nemo_skills/dataset/aai/__init__.py(1 hunks)nemo_skills/dataset/aalcr/__init__.py(1 hunks)nemo_skills/dataset/aime24/__init__.py(1 hunks)nemo_skills/dataset/aime25/__init__.py(1 hunks)nemo_skills/dataset/algebra222/__init__.py(1 hunks)nemo_skills/dataset/amc23/__init__.py(1 hunks)nemo_skills/dataset/answer-judge/__init__.py(0 hunks)nemo_skills/dataset/arena-hard/__init__.py(1 hunks)nemo_skills/dataset/asdiv/__init__.py(1 hunks)nemo_skills/dataset/beyond-aime/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v3/prepare.py(1 hunks)nemo_skills/dataset/bigcodebench/__init__.py(1 hunks)nemo_skills/dataset/brumo25/__init__.py(1 hunks)nemo_skills/dataset/college_math/__init__.py(1 hunks)nemo_skills/dataset/comp-math-24-25/__init__.py(1 hunks)nemo_skills/dataset/flores200/__init__.py(1 hunks)nemo_skills/dataset/gaokao2023en/__init__.py(1 hunks)nemo_skills/dataset/gpqa/__init__.py(1 hunks)nemo_skills/dataset/gsm-plus/__init__.py(1 hunks)nemo_skills/dataset/gsm8k/__init__.py(1 hunks)nemo_skills/dataset/hle/__init__.py(1 hunks)nemo_skills/dataset/hmmt_feb25/__init__.py(1 hunks)nemo_skills/dataset/human-eval-infilling/__init__.py(1 hunks)nemo_skills/dataset/human-eval/__init__.py(1 hunks)nemo_skills/dataset/ifbench/__init__.py(1 hunks)nemo_skills/dataset/ifeval/__init__.py(1 hunks)nemo_skills/dataset/ioi24/__init__.py(1 hunks)nemo_skills/dataset/ioi25/__init__.py(1 hunks)nemo_skills/dataset/livebench-coding/__init__.py(1 hunks)nemo_skills/dataset/livecodebench-cpp/__init__.py(1 hunks)nemo_skills/dataset/livecodebench-pro/__init__.py(1 hunks)nemo_skills/dataset/livecodebench/__init__.py(1 hunks)nemo_skills/dataset/math-500/__init__.py(1 hunks)nemo_skills/dataset/math-odyssey/__init__.py(1 hunks)nemo_skills/dataset/math/__init__.py(1 hunks)nemo_skills/dataset/mawps/__init__.py(1 hunks)nemo_skills/dataset/mbpp/__init__.py(1 hunks)nemo_skills/dataset/minerva_math/__init__.py(1 hunks)nemo_skills/dataset/minif2f/__init__.py(1 hunks)nemo_skills/dataset/mmlu-pro/__init__.py(1 hunks)nemo_skills/dataset/mmlu-prox/__init__.py(1 hunks)nemo_skills/dataset/mmlu-redux/__init__.py(1 hunks)nemo_skills/dataset/mmlu/__init__.py(1 hunks)nemo_skills/dataset/mobench/__init__.py(1 hunks)nemo_skills/dataset/mrcr/__init__.py(1 hunks)nemo_skills/dataset/ojbench/__init__.py(1 hunks)nemo_skills/dataset/olympiadbench/__init__.py(1 hunks)nemo_skills/dataset/omni-math/__init__.py(1 hunks)nemo_skills/dataset/proofnet/__init__.py(1 hunks)nemo_skills/dataset/putnam-bench/__init__.py(1 hunks)nemo_skills/dataset/ruler/prepare.py(1 hunks)nemo_skills/dataset/scicode/__init__.py(1 hunks)nemo_skills/dataset/simpleqa/__init__.py(1 hunks)nemo_skills/dataset/supergpqa/__init__.py(1 hunks)nemo_skills/dataset/svamp/__init__.py(1 hunks)nemo_skills/dataset/swe-bench/__init__.py(1 hunks)nemo_skills/dataset/wmt24pp/__init__.py(1 hunks)nemo_skills/evaluation/evaluate_results.py(0 hunks)nemo_skills/evaluation/evaluator/__init__.py(1 hunks)nemo_skills/evaluation/evaluator/arena.py(1 hunks)nemo_skills/evaluation/evaluator/base.py(2 hunks)nemo_skills/evaluation/evaluator/bfcl.py(2 hunks)nemo_skills/evaluation/evaluator/code.py(7 hunks)nemo_skills/evaluation/evaluator/ifbench.py(1 hunks)nemo_skills/evaluation/evaluator/ifeval.py(1 hunks)nemo_skills/evaluation/evaluator/ioi.py(2 hunks)nemo_skills/evaluation/evaluator/livecodebench.py(4 hunks)nemo_skills/evaluation/evaluator/math.py(4 hunks)nemo_skills/evaluation/evaluator/mcq.py(2 hunks)nemo_skills/evaluation/evaluator/mrcr.py(2 hunks)nemo_skills/evaluation/evaluator/ojbench.py(2 hunks)nemo_skills/evaluation/evaluator/ruler.py(3 hunks)nemo_skills/evaluation/evaluator/scicode.py(3 hunks)nemo_skills/evaluation/math_grader.py(0 hunks)nemo_skills/inference/eval/bfcl.py(1 hunks)nemo_skills/inference/eval/scicode.py(1 hunks)nemo_skills/inference/eval/swebench.py(2 hunks)nemo_skills/inference/generate.py(10 hunks)nemo_skills/inference/model/parallel_thinking.py(1 hunks)nemo_skills/pipeline/eval.py(0 hunks)nemo_skills/pipeline/generate.py(0 hunks)nemo_skills/pipeline/utils/eval.py(5 hunks)nemo_skills/pipeline/utils/generation.py(1 hunks)nemo_skills/utils.py(1 hunks)recipes/openmathreasoning/scripts/simplified_recipe.py(1 hunks)recipes/openreasoning/eval.py(7 hunks)tests/gpu-tests/test_generate.py(1 hunks)tests/slurm-tests/qwen3_4b_evals/run_test.py(1 hunks)tests/slurm-tests/super_49b_evals/run_test.py(1 hunks)tests/test_configs.py(0 hunks)tests/test_generation.py(1 hunks)
💤 Files with no reviewable changes (6)
- nemo_skills/pipeline/eval.py
- nemo_skills/dataset/answer-judge/init.py
- nemo_skills/pipeline/generate.py
- nemo_skills/evaluation/math_grader.py
- tests/test_configs.py
- nemo_skills/evaluation/evaluate_results.py
🧰 Additional context used
🧬 Code graph analysis (22)
recipes/openmathreasoning/scripts/simplified_recipe.py (1)
nemo_skills/pipeline/cli.py (1)
wrap_arguments(42-51)
nemo_skills/evaluation/evaluator/scicode.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (1)
nested_dataclass(63-96)
recipes/openreasoning/eval.py (1)
nemo_skills/pipeline/cli.py (1)
wrap_arguments(42-51)
nemo_skills/evaluation/evaluator/arena.py (1)
nemo_skills/utils.py (1)
get_logger_name(37-41)
nemo_skills/evaluation/evaluator/ifbench.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (1)
get_logger_name(37-41)
nemo_skills/inference/eval/scicode.py (1)
nemo_skills/utils.py (1)
remove_thinking(47-60)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/utils.py (1)
remove_thinking(47-60)
nemo_skills/evaluation/evaluator/ruler.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (2)
get_logger_name(37-41)nested_dataclass(63-96)
nemo_skills/evaluation/evaluator/code.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (2)
get_logger_name(37-41)nested_dataclass(63-96)
nemo_skills/evaluation/evaluator/ifeval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (1)
get_logger_name(37-41)
nemo_skills/evaluation/evaluator/mcq.py (3)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/evaluation/math_grader.py (1)
extract_answer(95-139)nemo_skills/utils.py (1)
get_logger_name(37-41)
nemo_skills/evaluation/evaluator/bfcl.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (1)
nested_dataclass(63-96)
nemo_skills/inference/eval/swebench.py (2)
nemo_skills/utils.py (4)
get_help_message(335-355)get_logger_name(37-41)nested_dataclass(63-96)setup_logging(99-134)nemo_skills/inference/generate.py (1)
evaluate_single_datapoint(589-595)
nemo_skills/evaluation/evaluator/__init__.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
eval_full(42-72)nemo_skills/evaluation/evaluator/ioi.py (1)
eval_full(377-393)
nemo_skills/evaluation/evaluator/base.py (2)
nemo_skills/utils.py (1)
nested_dataclass(63-96)nemo_skills/evaluation/evaluator/ioi.py (1)
eval_full(377-393)
nemo_skills/evaluation/evaluator/ojbench.py (4)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(37-93)nemo_skills/evaluation/evaluator/livecodebench.py (3)
execute_in_sandbox_with_retries(63-91)is_sandbox_available(94-115)sandbox_context(53-60)nemo_skills/utils.py (1)
nested_dataclass(63-96)
nemo_skills/pipeline/utils/eval.py (1)
nemo_skills/inference/generate.py (1)
get_generation_default_args(247-255)
nemo_skills/evaluation/evaluator/math.py (4)
nemo_skills/code_execution/proof_utils.py (3)
ProofBuildConfig(26-32)build_lean4_proof(93-133)determine_proof_status(136-162)nemo_skills/code_execution/sandbox.py (1)
get_sandbox(339-342)nemo_skills/evaluation/evaluator/base.py (2)
BaseEvaluator(34-91)BaseEvaluatorConfig(27-31)nemo_skills/evaluation/math_grader.py (2)
extract_answer(95-139)math_equal(37-92)
nemo_skills/evaluation/evaluator/ioi.py (3)
nemo_skills/evaluation/evaluator/base.py (3)
BaseEvaluator(34-91)BaseEvaluatorConfig(27-31)eval_full(42-72)nemo_skills/file_utils.py (1)
jdump(68-100)nemo_skills/utils.py (1)
nested_dataclass(63-96)
nemo_skills/evaluation/evaluator/mrcr.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (1)
get_logger_name(37-41)
nemo_skills/inference/generate.py (3)
nemo_skills/evaluation/evaluator/__init__.py (3)
evaluate(112-125)get_evaluator_class(90-100)supports_single_eval(103-109)nemo_skills/evaluation/evaluator/base.py (2)
supports_single_eval(89-91)eval_single(74-87)nemo_skills/utils.py (1)
remove_thinking(47-60)
nemo_skills/evaluation/evaluator/livecodebench.py (3)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(37-93)nemo_skills/utils.py (1)
nested_dataclass(63-96)
🪛 LanguageTool
docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
[grammar] ~214-~214: There might be a mistake here.
Context: ... not specify ++remove_thinking=True as RULER has predefined setup for those par...
(QB_NEW_EN)
[grammar] ~215-~215: There might be a mistake here.
Context: ...ecify ++remove_thinking=True as RULER has predefined setup for those parameters. ...
(QB_NEW_EN)
docs/evaluation/index.md
[grammar] ~54-~54: There might be a mistake here.
Context: ... evaluation of reasoning models, either provide reasoning parser in server args (e.g. `...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/index.md
53-53: Code block style
Expected: indented; Actual: fenced
(MD046, code-block-style)
🪛 Ruff (0.14.0)
nemo_skills/evaluation/evaluator/ifbench.py
39-39: subprocess call with shell=True identified, security issue
(S602)
46-46: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
51-51: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/evaluation/evaluator/code.py
264-264: Avoid specifying long messages outside the exception class
(TRY003)
276-276: Possible hardcoded password assigned to argument: "pass_k"
(S106)
338-338: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/evaluation/evaluator/ifeval.py
39-39: subprocess call with shell=True identified, security issue
(S602)
46-46: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
51-51: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/evaluation/evaluator/bfcl.py
63-63: subprocess call with shell=True identified, security issue
(S602)
69-69: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
nemo_skills/evaluation/evaluator/ojbench.py
103-103: Use explicit conversion flag
Replace with conversion flag
(RUF010)
105-105: Use explicit conversion flag
Replace with conversion flag
(RUF010)
106-106: Use explicit conversion flag
Replace with conversion flag
(RUF010)
122-122: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/evaluation/evaluator/math.py
70-72: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/evaluation/evaluator/ioi.py
385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/evaluation/evaluator/livecodebench.py
217-217: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
262-262: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
| cmd = ( | ||
| "cd /opt/benchmarks/IFBench && python -m run_eval " | ||
| f"--input_data={jsonl_file} " | ||
| f"--input_response_data={jsonl_file} " | ||
| f"--output_dir={output_dir} " | ||
| ) | ||
| subprocess.run(cmd, shell=True, check=True) |
There was a problem hiding this comment.
🧩 Analysis chain
Security: Review subprocess call with shell=True.
The use of subprocess.run with shell=True at Line 39 poses a security risk if jsonl_file or other variables contain untrusted input. While jsonl_file comes from cfg.input_file (likely controlled), ensure this path is validated or consider using shell=False with proper command construction.
If shell=True is necessary, ensure jsonl_file is properly validated and does not contain special shell characters. Alternatively, refactor to use shell=False:
cmd_list = [
"python", "-m", "run_eval",
f"--input_data={jsonl_file}",
f"--input_response_data={jsonl_file}",
f"--output_dir={output_dir}",
]
subprocess.run(cmd_list, cwd="/opt/benchmarks/IFBench", check=True)I'll investigate the security concern by examining the actual code context and how the input parameters are handled.
#!/bin/bash
# First, examine the actual code in the file around lines 33-39
fd -t f "ifbench.py" --exec cat {} \;#!/bin/bash
# Look at the function context more carefully
rg -nB10 -A10 "subprocess.run.*shell=True" nemo_skills/evaluation/evaluator/ifbench.py#!/bin/bash
# Find where cfg.input_file is defined or used
rg -n "input_file" nemo_skills/evaluation/evaluator/ifbench.py -B5 -A5#!/bin/bash
# Check for any validation functions in the module
rg -n "def.*valid|validate|sanitize" nemo_skills/evaluation/evaluator/ifbench.py#!/bin/bash
# Look at imports and overall module structure
head -50 nemo_skills/evaluation/evaluator/ifbench.py#!/bin/bash
# Check if there are any validation patterns for file paths in the codebase
rg -n "cfg\.input" nemo_skills/ -C3 | head -50#!/bin/bash
# Check what happens with the output_dir parameter
rg -n "output_dir" nemo_skills/evaluation/evaluator/ifbench.py -B3 -A3#!/bin/bash
# Look at the complete function signature and full body
ast-grep --pattern $'def $$$($_) {
$$$
}'#!/bin/bash
# Simpler approach: get the whole file to understand context
wc -l nemo_skills/evaluation/evaluator/ifbench.py && echo "---" && cat nemo_skills/evaluation/evaluator/ifbench.py```bash
#!/bin/bash
# Check for shell metacharacters that could break the command
# Look at how cfg is typically used in this project
rg -n "cfg\." nemo_skills/evaluation/evaluator/ifbench.py | head -20
---
I need to examine the actual code to verify this security concern.
🏁 Script executed:
#!/bin/bash
cat nemo_skills/evaluation/evaluator/ifbench.pyLength of output: 2251
🏁 Script executed:
#!/bin/bash
rg -n "input_file|output_dir|cfg\." nemo_skills/evaluation/evaluator/ifbench.py -B2 -A2Length of output: 1178
🏁 Script executed:
#!/bin/bash
rg -n "subprocess" nemo_skills/evaluation/evaluator/ifbench.py -B5 -A5Length of output: 790
🏁 Script executed:
#!/bin/bash
# Check for any validation of paths in this module or parent modules
fd -t f "*.py" nemo_skills/evaluation/evaluator/ --exec rg -l "validate|sanitize|Path|pathlib" {} \;Length of output: 283
🏁 Script executed:
#!/bin/bash
# Check the config/parameter structure
rg -n "class.*Config|@dataclass" nemo_skills/evaluation/evaluator/ -B2 -A10 | head -80Length of output: 6010
#!/bin/bash
# Find and inspect BaseEvaluatorConfig to see if input_file is validated
fd "base.py" nemo_skills/evaluation/evaluator/ -exec cat {} \;#!/bin/bash
# Check if input_file appears to have any validation in the base config or docs
rg -n "input_file" nemo_skills/evaluation/evaluator/base.py -B5 -A5
---
</details>
**Use `shell=False` with `cwd` parameter to eliminate shell injection risk.**
The current subprocess call at line 39 interpolates `jsonl_file` and `output_dir` directly into a shell command string. If these paths contain spaces or shell metacharacters (e.g., `;`, `$`, backticks), the command will parse incorrectly or execute unintended code. This is a recognized security vulnerability (CWE-78).
Refactor to avoid shell parsing entirely:
```python
cmd_list = [
"python", "-m", "run_eval",
f"--input_data={jsonl_file}",
f"--input_response_data={jsonl_file}",
f"--output_dir={output_dir}",
]
subprocess.run(cmd_list, cwd="/opt/benchmarks/IFBench", check=True)
This approach uses cwd parameter instead of cd and avoids shell interpretation of path variables.
🧰 Tools
🪛 Ruff (0.14.0)
39-39: subprocess call with shell=True identified, security issue
(S602)
🤖 Prompt for AI Agents
nemo_skills/evaluation/evaluator/ifbench.py around lines 33 to 39: the
subprocess call builds a shell command string with interpolated paths and uses
shell=True, which risks shell injection and misparsing; change to pass an
argument list (e.g., ["python", "-m", "run_eval", f"--input_data={jsonl_file}",
f"--input_response_data={jsonl_file}", f"--output_dir={output_dir}"]) to
subprocess.run, remove the inline "cd" and instead set
cwd="/opt/benchmarks/IFBench", and call subprocess.run with check=True
(shell=False by default) so the command is executed securely without shell
interpretation.
| for sample, eval_result in zip(samples, eval_results): | ||
| sample["loose_eval"] = eval_result |
There was a problem hiding this comment.
Add strict=True to zip() for safer pairing.
The zip() call should use strict=True (Python 3.10+) to ensure samples and eval_results have matching lengths and catch potential data misalignment.
Apply this diff:
- for sample, eval_result in zip(samples, eval_results):
+ for sample, eval_result in zip(samples, eval_results, strict=True):
sample["loose_eval"] = eval_resultApply the same fix on line 51:
- for sample, eval_result in zip(samples, eval_results):
+ for sample, eval_result in zip(samples, eval_results, strict=True):
sample["strict_eval"] = eval_result📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for sample, eval_result in zip(samples, eval_results): | |
| sample["loose_eval"] = eval_result | |
| for sample, eval_result in zip(samples, eval_results, strict=True): | |
| sample["loose_eval"] = eval_result |
🧰 Tools
🪛 Ruff (0.14.0)
46-46: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ifeval.py around lines 46-47 (and similarly
at line 51), the zip(samples, eval_results) calls must be made strict to detect
mismatched lengths; update both calls to zip(samples, eval_results, strict=True)
so Python raises an error on length mismatch, ensuring safe pairing and avoiding
silent data misalignment.
| for s, o in zip(all_samples, outputs): | ||
| s["test_case_results"] = o["test_case_results"] | ||
| s["eval_status"] = o["eval_status"] |
There was a problem hiding this comment.
Add strict=True to zip() and fix missing eval_status field.
Two issues:
- The
zip()call should usestrict=Trueto ensureall_samplesandoutputshave matching lengths. - Line 387 references
o["eval_status"], but_evaluate_entry(lines 371-375) returns a dict with onlyname,subtask, andtest_case_resultsfields—eval_statusis missing.
Apply this diff:
- for s, o in zip(all_samples, outputs):
+ for s, o in zip(all_samples, outputs, strict=True):
s["test_case_results"] = o["test_case_results"]
- s["eval_status"] = o["eval_status"]If eval_status is required, update _evaluate_entry to include it in the returned dict at line 371.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for s, o in zip(all_samples, outputs): | |
| s["test_case_results"] = o["test_case_results"] | |
| s["eval_status"] = o["eval_status"] | |
| for s, o in zip(all_samples, outputs, strict=True): | |
| s["test_case_results"] = o["test_case_results"] |
🧰 Tools
🪛 Ruff (0.14.0)
385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 385-387, the
zip(all_samples, outputs) call must use zip(..., strict=True) to ensure lengths
match, and the code currently assigns s["eval_status"] = o["eval_status"] but
_evaluate_entry (around line 371) does not return eval_status; update
_evaluate_entry to compute and include an "eval_status" field in its returned
dict (e.g., derive status from test_case_results or existing logic) so the
caller can safely read o["eval_status"].
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
nemo_skills/evaluation/evaluator/ruler.py (1)
68-80: Non-atomic write still risks data loss.This is the same issue flagged in the previous review. The file is still overwritten in place without atomic guarantees. If the process crashes, runs out of disk space, or encounters a write error at line 80, the original data will be lost.
As suggested in the previous review, use a temporary file and
os.replace()for atomic writes:jsonl_file = cfg.input_file with open(jsonl_file, "rt", encoding="utf-8") as fin: data = [json.loads(line) for line in fin] for sample in tqdm(data): parse_result = parse_funcs[eval_config.parse_func](sample["generation"]) sample["is_correct"] = match_type_funcs[eval_config.match_type]( sample["generation"], sample["expected_answer"] ) sample["predicted_answer"] = parse_result +import os +temp_file = jsonl_file + ".tmp" -with open(jsonl_file, "wt", encoding="utf-8") as fout: +with open(temp_file, "wt", encoding="utf-8") as fout: for sample in data: fout.write(json.dumps(sample) + "\n") + +os.replace(temp_file, jsonl_file)
🧹 Nitpick comments (5)
nemo_skills/evaluation/evaluator/ruler.py (3)
24-24: Unused logger variable.The
LOGvariable is defined but never used in this file.Consider removing it:
-LOG = logging.getLogger(get_logger_name(__file__))Or use it to add informative logging, e.g., before processing:
LOG.info(f"Evaluating {jsonl_file} with parse_func={eval_config.parse_func}, match_type={eval_config.match_type}")
27-30: Consider adding field validation for parse_func and match_type.The config allows any string for
parse_funcandmatch_type, but only specific values ("default" for parse_func; "all" or "part" for match_type) are valid. If an invalid value is provided, the evaluation will fail with a KeyError at runtime.Consider adding validation in the config or early in
eval_ruler:@nested_dataclass(kw_only=True) class RulerEvaluatorConfig(BaseEvaluatorConfig): parse_func: str = "default" match_type: str def __post_init__(self): valid_parse_funcs = {"default"} valid_match_types = {"all", "part"} if self.parse_func not in valid_parse_funcs: raise ValueError(f"parse_func must be one of {valid_parse_funcs}, got {self.parse_func}") if self.match_type not in valid_match_types: raise ValueError(f"match_type must be one of {valid_match_types}, got {self.match_type}")
72-76: Consider adding defensive validation forgenerationandexpected_answerfields.The test data confirms both fields are present in
ruler-specific JSONL files (e.g.,./tests/data/score_rm_inputs/output.jsonl). However, there is no upstream validation at the caller level, andlivecodebench.pydemonstrates that schema validation is used elsewhere in the codebase. Adding a defensive check—similar to howlivecodebench.pyuses_preprocess_and_validate_file—would prevent crashes if incorrect data is passed to the evaluator:for sample in tqdm(data): + if "generation" not in sample or "expected_answer" not in sample: + LOG.warning(f"Skipping sample missing required fields: {sample.get('id', 'unknown')}") + continue parse_result = parse_funcs[eval_config.parse_func](sample["generation"])nemo_skills/utils.py (2)
37-41: Consider edge cases in path handling.The current implementation has a few potential edge cases:
- If "/nemo_skills/" appears multiple times in the path,
split("nemo_skills")[1]uses the first occurrence, which might not be the desired behavior.- Windows paths use backslashes, but the function only checks for forward slashes.
- The
.replace(".py", "")only removes.pyextensions but not other Python-related extensions like.pyxor.pyi.Consider using
Pathoperations for more robust path handling:def get_logger_name(file): + file_path = Path(file) + file_str = file_path.as_posix() # Convert to forward slashes - if "/nemo_skills/" in file: - return "nemo_skills" + file.split("nemo_skills")[1].replace("/", ".").replace(".py", "") + if "/nemo_skills/" in file_str: + parts = file_str.split("/nemo_skills/", 1) # Split only once + return "nemo_skills." + parts[1].replace("/", ".").removesuffix(file_path.suffix) else: - return f"[external] {Path(file).stem}" + return f"[external] {file_path.stem}"
47-64: Consider rate-limiting the warning to avoid log spam.The warning on lines 59-64 is logged for every sample that doesn't have the thinking end tag. If many samples are missing the end tag (e.g., when accidentally using a non-reasoning model with
remove_thinking=True), this could spam the logs significantly.Consider one of these approaches:
- Log the warning once per execution with a counter
- Use a rate-limited logger
- Add a summary at the end showing how many samples were affected
Example approach 1:
+_THINKING_END_WARNING_LOGGED = False + def remove_thinking(sample: dict, generation_key: str = "generation", thinking_end: str = "</think>"): + global _THINKING_END_WARNING_LOGGED # not doing anything if generation isn't a string # TODO: should we be more explicit about this? if not isinstance(sample[generation_key], str): return sample[f"_{generation_key}_finished_thinking"] = thinking_end in sample[generation_key] if thinking_end in sample[generation_key]: sample[f"_full_{generation_key}"] = sample[generation_key] sample[generation_key] = sample[generation_key].split(thinking_end)[-1].strip() else: sample[f"_full_{generation_key}"] = sample[generation_key] sample[generation_key] = "" # no end tag, so setting the generation to empty - LOG.warning( - "Thinking end tag `%s` not found in generation; setting generation to empty. " - "If this happens for every generation, you might have accidentally set ++remove_thinking=True for a " - "non-reasoning model or have incorrect end tag.", - thinking_end, - ) + if not _THINKING_END_WARNING_LOGGED: + LOG.warning( + "Thinking end tag `%s` not found in generation; setting generation to empty. " + "If this happens for every generation, you might have accidentally set ++remove_thinking=True for a " + "non-reasoning model or have incorrect end tag. (Further warnings suppressed)", + thinking_end, + ) + _THINKING_END_WARNING_LOGGED = True
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
nemo_skills/dataset/mmlu-prox/__init__.py(1 hunks)nemo_skills/evaluation/evaluator/ruler.py(3 hunks)nemo_skills/inference/generate.py(11 hunks)nemo_skills/inference/model/parallel_thinking.py(2 hunks)nemo_skills/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/mmlu-prox/init.py
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/utils.py (1)
remove_thinking(47-64)
nemo_skills/evaluation/evaluator/ruler.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (2)
get_logger_name(37-41)nested_dataclass(67-100)
nemo_skills/inference/generate.py (3)
nemo_skills/evaluation/evaluator/__init__.py (3)
evaluate(112-125)get_evaluator_class(90-100)supports_single_eval(103-109)nemo_skills/evaluation/evaluator/base.py (2)
supports_single_eval(89-91)eval_single(74-87)nemo_skills/utils.py (1)
remove_thinking(47-64)
🪛 Ruff (0.14.0)
nemo_skills/inference/generate.py
644-644: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (11)
nemo_skills/evaluation/evaluator/ruler.py (1)
58-58: LGTM!The config instantiation correctly uses the new
RulerEvaluatorConfigwith dictionary unpacking.nemo_skills/utils.py (1)
44-44: LGTM!Module-level logger initialization follows standard Python logging patterns.
nemo_skills/inference/generate.py (8)
35-39: LGTM!Evaluation imports are now centralized and unconditional, which aligns with the new evaluation-into-generation integration approach described in the PR.
180-189: LGTM!The addition of
eval_typeandeval_configfields toGenerateSolutionsConfigproperly consolidates evaluation configuration into the generation parameters, as described in the PR objectives. The warning comment aboutremove_thinkingis helpful for preventing misuse.
337-343: Evaluator setup looks correct.The conditional initialization based on
supports_single_eval()properly distinguishes between per-datapoint (interleaved) and batch evaluation flows. Note that line 340 mutatesself.cfg.eval_configto ensure it's a plain dict—this is likely intentional for compatibility with evaluator classes, but be aware that the original type information is lost.
452-455: LGTM!The
run_batch_evaluationmethod correctly configures the evaluator to consume the generated output file. Overriding any existinginput_fileineval_configis the correct behavior here.
525-545: LGTM with a minor observation.The
postprocess_single_outputmethod correctly handles renaming, stat removal, data merging, and thinking removal. The logic at lines 537-539 properly merges the original data point with the generated output, preserving output values when keys overlap.Note: This method is declared
asyncbut contains noawaitstatements. This is acceptable if it's for future extensibility or to maintain a consistent async API, but consider adding a comment explaining this if it's intentional.
589-595: LGTM!The
evaluate_single_datapointmethod correctly times the evaluation and merges results into the data point. The metric nameinterleaved_eval_single_time_sclearly distinguishes it from batch evaluation timing.
609-613: LGTM!The pipeline correctly sequences postprocessing before evaluation, which is important since the evaluator expects the postprocessed output format. The conditional check
if self.should_run_evaluation and self.evaluatorproperly handles cases where evaluation is requested but only batch evaluation is supported.
717-737: Verify batch evaluation behavior when no data is processed.When
len(data) == 0(line 717), generation is skipped but the batch evaluation check on lines 735-736 is still executed. This could attempt to evaluate an empty or non-existent output file.Please verify this is the intended behavior. If not, consider moving the batch evaluation inside the else block:
if len(data) == 0: LOG.info("No data to process, skipping generation") else: data = self.preprocess_data(data) self.log_example_prompt(data) if self.cfg.dry_run: LOG.info("Exiting without running generation as dry_run flag is set.") return if not self.cfg.skip_filled: for output_path in [Path(self.cfg.output_file), Path(self.cfg.output_file + "-async")]: if output_path.exists(): output_path.unlink() self.wait_for_server() asyncio.run(self.async_loop(data)) - - if self.should_run_evaluation and self.evaluator is None: - self.run_batch_evaluation() + + if self.should_run_evaluation and self.evaluator is None: + self.run_batch_evaluation() + self.postprocess()nemo_skills/inference/model/parallel_thinking.py (1)
152-157: LGTM! Correctly updated remove_thinking calls.The
thinking_beginparameter has been properly removed from bothremove_thinkingcalls, aligning with the updated function signature that now relies solely onthinking_endfor processing.Also applies to: 187-192
| for output, data_point in zip(prefilled_outputs, prefilled_data_points): | ||
| await self.postprocess_single_output(output, data_point) | ||
|
|
||
| # evaluate single-data point if requested and evaluator supports that | ||
| if self.should_run_evaluation and self.evaluator: | ||
| output = await self.evaluate_single_datapoint({**data_point, **output}) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add strict=True to zip() for safety.
The static analysis tool correctly flags that zip() on line 644 should use an explicit strict= parameter. Without it, if prefilled_outputs and prefilled_data_points have different lengths, the zip will silently truncate to the shorter length, potentially causing data loss.
- for output, data_point in zip(prefilled_outputs, prefilled_data_points):
+ for output, data_point in zip(prefilled_outputs, prefilled_data_points, strict=True):
await self.postprocess_single_output(output, data_point)Note: strict=True requires Python 3.10+. If supporting earlier versions, add a length check before the loop.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for output, data_point in zip(prefilled_outputs, prefilled_data_points): | |
| await self.postprocess_single_output(output, data_point) | |
| # evaluate single-data point if requested and evaluator supports that | |
| if self.should_run_evaluation and self.evaluator: | |
| output = await self.evaluate_single_datapoint({**data_point, **output}) | |
| for output, data_point in zip(prefilled_outputs, prefilled_data_points, strict=True): | |
| await self.postprocess_single_output(output, data_point) | |
| # evaluate single-data point if requested and evaluator supports that | |
| if self.should_run_evaluation and self.evaluator: | |
| output = await self.evaluate_single_datapoint({**data_point, **output}) |
🧰 Tools
🪛 Ruff (0.14.0)
644-644: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/inference/generate.py around lines 644 to 649, the loop zipping
prefilled_outputs and prefilled_data_points should not silently truncate if
lengths differ; either add zip(..., strict=True) to enforce equal lengths
(requires Python 3.10+) or, if older Python must be supported, check
len(prefilled_outputs) == len(prefilled_data_points) before the loop and raise
or handle the mismatch; update the code to use one of these approaches so
mismatched lengths are detected rather than silently dropped.
|
|
||
| remove_thinking: bool = True # Remove thinking tokens from the solution key | ||
| thinking_begin: str = "<think>" | ||
| remove_thinking: bool = False |
There was a problem hiding this comment.
🧩 Analysis chain
Verify that the breaking default change is documented.
The default value for remove_thinking has changed from True to False. This is a breaking change that may affect existing configurations that relied on the previous default behavior.
Ensure that:
- This change is clearly documented in the PR description, migration guide, or release notes
- Existing configurations that depend on thinking removal explicitly set
remove_thinking=True
Run the following script to find configurations that might be affected:
🏁 Script executed:
#!/bin/bash
# Description: Find ParallelThinkingConfig usage without explicit remove_thinking setting
# Search for ParallelThinkingConfig instantiations
rg -n -A 10 -B 2 'ParallelThinkingConfig\(' --type=py
# Search for YAML/config files that might configure parallel thinking
fd -e yaml -e yml -e json --exec rg -l 'parallel.*thinking|genselect|gensynthesis' {}Length of output: 1043
I need to examine the configuration instantiation more closely and inspect the configuration files that might be affected.
🏁 Script executed:
#!/bin/bash
# Understand ParallelThinkingConfig instantiation and parameter filtering
# Get full context around line 91 in __init__.py
echo "=== Context around ParallelThinkingConfig instantiation ==="
rg -n -B 15 -A 5 'ParallelThinkingConfig\(' nemo_skills/inference/model/__init__.py | head -40
# Check the qwq.yaml config file
echo -e "\n=== Content of qwq.yaml ==="
cat ./recipes/openmathreasoning/configs/genselect_sdg/qwq.yaml 2>/dev/null || echo "File not found"
# Search for all config files in recipes that might use parallel_thinking or remove_thinking
echo -e "\n=== All YAML config files in recipes ==="
fd -e yaml -e yml ./recipes --exec rg -l 'remove_thinking|parallel' {} \; 2>/dev/null | head -20Length of output: 5255
🏁 Script executed:
#!/bin/bash
# Understand ParallelThinkingConfig instantiation and parameter filtering
# Get full context around line 91 in __init__.py
echo "=== Context around ParallelThinkingConfig instantiation ==="
rg -n -B 15 -A 5 'ParallelThinkingConfig\(' nemo_skills/inference/model/__init__.py | head -50
# Check the qwq.yaml config file
echo -e "\n=== Content of qwq.yaml ==="
cat ./recipes/openmathreasoning/configs/genselect_sdg/qwq.yaml 2>/dev/null || echo "File not found"
# Search for valid_params definition
echo -e "\n=== Searching for valid_params definition ==="
rg -n -B 5 -A 5 'valid_params' nemo_skills/inference/model/__init__.pyLength of output: 5873
🏁 Script executed:
#!/bin/bash
# Search for explicit remove_thinking settings
echo "=== Searching for remove_thinking settings in code ==="
rg -n 'remove_thinking\s*=' --type=py -B 2 -A 2
echo -e "\n=== Searching for remove_thinking in YAML configs ==="
fd -e yaml -e yml -exec rg -l 'remove_thinking' {} \;
echo -e "\n=== Check ParallelThinkingConfig dataclass definition ==="
rg -n -B 2 -A 15 'class ParallelThinkingConfig' --type=py
echo -e "\n=== Search for where parallel_thinking_config is passed in code ==="
rg -n 'remove_thinking' nemo_skills/inference/model/parallel_thinking.py | head -20Length of output: 7074
🏁 Script executed:
#!/bin/bash
# Search for explicit remove_thinking settings
echo "=== Searching for remove_thinking settings in code ==="
rg -n 'remove_thinking\s*=' --type=py -B 2 -A 2
echo -e "\n=== Searching for remove_thinking in YAML configs ==="
fd -e yaml -e yml -exec rg -l 'remove_thinking' {} \;
echo -e "\n=== Check ParallelThinkingConfig dataclass definition ==="
rg -n -B 2 -A 20 'class ParallelThinkingConfig' --type=py
echo -e "\n=== Look at actual usage of remove_thinking in parallel_thinking.py ==="
rg -n 'self.cfg.remove_thinking' nemo_skills/inference/model/parallel_thinking.pyLength of output: 7356
🏁 Script executed:
#!/bin/bash
# Verify the remove_thinking function behavior to confirm logic bug
echo "=== Remove thinking function implementation ==="
rg -n -A 25 'def remove_thinking' nemo_skills/utils.py
echo -e "\n=== Context around line 226 in parallel_thinking.py ==="
sed -n '220,235p' nemo_skills/inference/model/parallel_thinking.pyLength of output: 2153
Fix critical logic error on line 226 that filters out all solutions when remove_thinking=True.
After remove_thinking() is called, the thinking_end tag should never remain in the generation. The function either splits it out (line 55 in utils.py) or sets generation to empty (line 58). Therefore, the condition on line 226 will always be true, causing all solutions to be filtered when remove_thinking=True.
The intent appears to be filtering unfinished solutions. Use the _{solution_key}_finished_thinking flag set by remove_thinking() instead:
# filtering unfinished solutions
if self.cfg.remove_thinking and not solution.get(f"_{self.cfg.solution_key}_finished_thinking", True):
continueRegarding the default value change on line 54: No breaking change impact. All 9 existing uses explicitly override with ++remove_thinking=True via Hydra configuration.
| # filtering unfinished solutions | ||
| if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]: | ||
| continue |
There was a problem hiding this comment.
Critical: Filtering logic will incorrectly remove all solutions.
After remove_thinking is called (lines 152-157 in generate_solutions and lines 187-192 in _load_solutions), the thinking_end marker is never present in the processed generation because:
- If
thinking_endwas present:remove_thinkingsplits on it and takes the text after - If
thinking_endwas NOT present:remove_thinkingsets generation to an empty string
Therefore, thinking_end not in solution[self.cfg.solution_key] will always be True after remove_thinking is applied, causing all solutions to be filtered out when remove_thinking=True.
To filter incomplete solutions correctly, check the flag set by remove_thinking:
# Remove unfinished solutions
filtered_solutions = []
for solution in solutions:
- # filtering unfinished solutions
- if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
+ # Filter out solutions where thinking was incomplete (thinking_end not found)
+ if self.cfg.remove_thinking and not solution["output_dict"].get(f"_{self.cfg.solution_key}_finished_thinking", True):
continue
else:
filtered_solutions.append(solution)Alternatively, check if the generation is empty:
- if self.cfg.remove_thinking and self.cfg.thinking_end not in solution[self.cfg.solution_key]:
+ if self.cfg.remove_thinking and not solution[self.cfg.solution_key]:
continueCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In nemo_skills/inference/model/parallel_thinking.py around lines 225 to 227, the
filter condition uses "thinking_end not in solution[self.cfg.solution_key]"
which will always be true after remove_thinking has run (it either strips the
marker or leaves an empty string), so all solutions are dropped; instead, change
the logic to detect incomplete solutions using the flag set by remove_thinking
(e.g., a boolean returned or stored when stripping thinking) or simply check for
an empty generation string (e.g., skip if solution[self.cfg.solution_key] == ""
or length == 0), or, if you need to inspect the original text, use the
pre-processed/original generation field to test for the presence of thinking_end
before remove_thinking was applied.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/inference/model/parallel_thinking.py (1)
370-398: Critical: Missing early return causes crash when all solutions are filtered out.When
solutionsis empty (e.g., all filtered bysolution_length_cap), lines 371-386 create a sensible defaultoutput_dict. However, execution continues to Step 2 (lines 388-398), which overwritesoutput_dictand attempts to run genselect/gensynthesis on an empty solutions list, causing crashes:
- GenSelect (line 326):
local_random.randint(0, -1)raisesValueError- GenSynthesis (line 349):
local_random.choice([])raisesIndexErrorAdd an early return after creating the default output:
if not solutions: output_dict = { self.cfg.solution_key: "", "solution_list": [], f"{self.cfg.mode}_comparison": "", f"{self.cfg.mode}_num_generated_tokens": 0, "total_solution_generated_tokens": total_num_generated_tokens, "num_generated_tokens": total_num_generated_tokens, # No additional tokens for genselect "num_best_solution_generated_tokens": 0, } # Required by inference/generate.py output_dict["generation"] = "" if self.cfg.count_prompt_tokens: # The input doesn't make sense for such cases where there are no solutions output_dict["num_input_tokens"] = None + return output_dict # Step 2: Run GenSelect/GenSynthesis
♻️ Duplicate comments (1)
nemo_skills/inference/generate.py (1)
647-652: Addstrict=Trueto zip() for safety.This issue has already been raised in a previous review. The zip() call should use
strict=Trueto prevent silent data loss if the lists have mismatched lengths.
🧹 Nitpick comments (6)
tests/slurm-tests/run_all.sh (1)
3-14: Consider adding a usage comment for script arguments.While the code is correct and aligns well with the PR's parameter consolidation, adding a brief comment at the top explaining the positional arguments would aid future maintainers. For example:
#!/bin/bash +# Usage: ./run_all.sh <CLUSTER> [RUN_NAME] +# CLUSTER: target cluster (required) +# RUN_NAME: test run identifier; defaults to YYYY-MM-DD (optional) + CLUSTER=$1 RUN_NAME=${2:-$(date +%Y-%m-%d)}nemo_skills/inference/eval/swebench.py (1)
150-152: Consider the design of rejected configuration fields.These fields are added to the config but immediately rejected in
__init__(lines 176-179). This creates a potentially confusing API where users might attempt to seteval_typeonly to receive an error.If these fields are present solely for interface consistency with the base
GenerationTaskclass, consider adding a comment explaining this design decision. Alternatively, if they serve no purpose, they could be omitted entirely from this configuration class.nemo_skills/evaluation/evaluator/ruler.py (1)
69-83: Atomic write implementation correctly addresses the previous data loss concern.The new implementation using a temporary file and
os.replace()prevents data corruption if the process is interrupted. This resolves the critical issue flagged in the previous review.However, there are two optional improvements to consider:
Processing loop placement: Lines 72-77 process data inside the file reading context manager, but the file is no longer needed after line 71. Moving the loop outside would be clearer.
Temp file cleanup: If an exception occurs during write or replace (lines 79-83), the temp file will be left behind. Adding a try-finally block would ensure cleanup.
Optional refactor to address both points:
jsonl_file = cfg.input_file with open(jsonl_file, "rt", encoding="utf-8") as fin: data = [json.loads(line) for line in fin] - for sample in tqdm(data): - parse_result = parse_funcs[eval_config.parse_func](sample["generation"]) - sample["is_correct"] = match_type_funcs[eval_config.match_type]( - sample["generation"], sample["expected_answer"] - ) - sample["predicted_answer"] = parse_result +for sample in tqdm(data): + parse_result = parse_funcs[eval_config.parse_func](sample["generation"]) + sample["is_correct"] = match_type_funcs[eval_config.match_type]( + sample["generation"], sample["expected_answer"] + ) + sample["predicted_answer"] = parse_result + +temp_file = jsonl_file + "-tmp" +try: - with open(jsonl_file + "-tmp", "wt", encoding="utf-8") as fout: + with open(temp_file, "wt", encoding="utf-8") as fout: for sample in data: fout.write(json.dumps(sample) + "\n") - -os.replace(jsonl_file + "-tmp", jsonl_file) + os.replace(temp_file, jsonl_file) +except Exception: + if os.path.exists(temp_file): + os.remove(temp_file) + raisenemo_skills/pipeline/utils/eval.py (1)
168-171: Environment pollution TODO is a valid concern.Setting
os.environ["NEMO_SKILLS_PRIVILEGED_DOCKER"]at line 170 persists across function calls and could affect other benchmarks processed in the same run. Consider scoping this to the specific swe-bench job or resetting it after use.Would you like me to open an issue to track proper isolation of this environment variable, or suggest a pattern using context managers for temporary environment changes?
nemo_skills/inference/generate.py (1)
528-548: Consider making postprocess_single_output synchronous.The method is declared async but performs no await operations. Unless async is needed for future extensibility, consider making it a regular synchronous method to simplify the call sites.
Apply this diff if you want to make it synchronous:
- async def postprocess_single_output(self, output, original_data_point): + def postprocess_single_output(self, output, original_data_point):Then update call sites at lines 612, 648 to remove
await:- await self.postprocess_single_output(output, data_point) + self.postprocess_single_output(output, data_point)nemo_skills/inference/model/parallel_thinking.py (1)
74-74: Remove unused configuration field.The
filter_incomplete_solutionsfield is no longer used anywhere in the code. According to the enriched summary, the filtering logic that used this field was intentionally removed. The config field should be cleaned up.- filter_incomplete_solutions: bool = True # Filter out incomplete solutions
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
nemo_skills/evaluation/evaluator/ruler.py(3 hunks)nemo_skills/inference/eval/swebench.py(4 hunks)nemo_skills/inference/generate.py(11 hunks)nemo_skills/inference/model/parallel_thinking.py(1 hunks)nemo_skills/pipeline/generate.py(0 hunks)nemo_skills/pipeline/utils/eval.py(5 hunks)nemo_skills/utils.py(1 hunks)tests/slurm-tests/run_all.sh(1 hunks)
💤 Files with no reviewable changes (1)
- nemo_skills/pipeline/generate.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/utils.py
🧰 Additional context used
🧬 Code graph analysis (5)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/utils.py (1)
remove_thinking(48-65)
nemo_skills/inference/generate.py (3)
nemo_skills/evaluation/evaluator/__init__.py (3)
evaluate(112-125)get_evaluator_class(90-100)supports_single_eval(103-109)nemo_skills/evaluation/evaluator/base.py (2)
supports_single_eval(89-91)eval_single(74-87)nemo_skills/utils.py (1)
remove_thinking(48-65)
nemo_skills/inference/eval/swebench.py (2)
nemo_skills/utils.py (4)
get_help_message(340-360)get_logger_name(38-42)nested_dataclass(68-101)setup_logging(104-139)nemo_skills/inference/generate.py (1)
evaluate_single_datapoint(592-598)
nemo_skills/evaluation/evaluator/ruler.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluatorConfig(27-31)nemo_skills/utils.py (2)
get_logger_name(38-42)nested_dataclass(68-101)
nemo_skills/pipeline/utils/eval.py (1)
nemo_skills/inference/generate.py (1)
get_generation_default_args(247-255)
🪛 Ruff (0.14.0)
nemo_skills/inference/generate.py
647-647: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/inference/eval/swebench.py
177-179: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (24)
tests/slurm-tests/run_all.sh (2)
3-4: Good parameter consolidation with sensible default.The refactor to make the date/run identifier a configurable second argument is clean and backward compatible. Callers can now override the date if needed (e.g., for re-running failed jobs with the same date), while defaulting to today's date for common use.
6-14: Consistent application of RUN_NAME across all test invocations.All five test jobs properly use
$RUN_NAMEin both--workspaceand--expname_prefixpaths, ensuring a coherent naming scheme across parallel jobs.nemo_skills/inference/eval/swebench.py (3)
33-38: LGTM: Clean import organization.The multi-line import format improves readability with no functional change.
176-179: LGTM: Clear validation prevents misconfiguration.The guard correctly prevents users from attempting to configure evaluation via
eval_type, since SWE-bench handles evaluation automatically through its harness. The error message clearly guides users to the correct behavior.
196-198: LGTM: Method rename aligns with standardized evaluation interface.The rename from
apply_evaluation_hooktoevaluate_single_datapointmatches the base class pattern (as seen ingenerate.py). The pass-through implementation is appropriate here since SWE-bench performs evaluation inline duringprocess_single_datapoint(lines 495-564).nemo_skills/evaluation/evaluator/ruler.py (3)
17-17: LGTM! Imports align with the refactoring.The new imports (
os,BaseEvaluatorConfig) support the atomic write implementation and the updated class hierarchy.Also applies to: 22-23
29-29: LGTM! Class hierarchy updated correctly.Inheriting from
BaseEvaluatorConfigprovides access toinput_file,data_dir, andsplitfields, which aligns with the PR's goal of fusing top-level evaluation config with inner eval_config.
59-59: LGTM! Config construction updated per PR design.The change from
**cfg.eval_configto**cfgreflects the flattened configuration structure where evaluation parameters are now at the top level.nemo_skills/pipeline/utils/eval.py (4)
129-133: Backward compatibility for deprecated EVAL_ARGS looks good.The logic correctly prepends legacy EVAL_ARGS to generation_args while injecting the new eval_config.split format. This provides a clean migration path.
223-223: eval_config.data_dir injection is correct.The conditional addition of
eval_config.data_dirto generation_args aligns with the PR's goal to consolidate evaluation configuration into generation arguments.
243-243: eval_config.data_dir injection is correct.Consistent with line 223, this properly injects
eval_config.data_dirinto generation_args for single-benchmark paths.
424-428: Argument concatenation order is correct.The ordering (generation defaults → benchmark args → job args) properly supports override semantics where job-specific arguments can override benchmark and default settings. This aligns well with Hydra's configuration override behavior.
nemo_skills/inference/generate.py (11)
35-39: LGTM!The evaluation imports are correctly added to support the new integrated evaluation flow.
87-87: LGTM!Updated docstring appropriately describes the config class.
180-182: LGTM!The comment appropriately warns users about the remove_thinking behavior change, which is critical for preventing empty generations on non-reasoning models.
187-189: LGTM!The evaluation configuration fields are properly integrated into the generation config, enabling opt-in evaluation.
337-343: LGTM!The evaluator initialization logic correctly distinguishes between per-datapoint and batch evaluation modes. The code properly checks if the evaluator supports single evaluation before instantiating it.
401-402: LGTM!The comment clarifies the intent of the inference override config pattern.
455-458: LGTM!The batch evaluation method is straightforward and correctly configured. It properly uses the output file as input for evaluation.
524-526: LGTM!The simplified dump_outputs method now focuses solely on writing, with preprocessing moved to postprocess_single_output.
592-598: LGTM!The per-datapoint evaluation method correctly times and executes single evaluations, providing performance insights.
612-616: LGTM!The integration of postprocessing and per-datapoint evaluation is correct. Postprocessing runs before evaluation as intended, and the conditional gating ensures evaluation only runs when appropriate.
720-740: LGTM!The refactored control flow correctly handles the two evaluation modes:
- Per-datapoint evaluation when evaluator is available (interleaved with generation)
- Batch evaluation when eval_type is set but evaluator is None (after generation completes)
The logic is clear and aligns with the PR objectives.
nemo_skills/inference/model/parallel_thinking.py (1)
153-158: LGTM: remove_thinking integration is correct.The
remove_thinkingcalls properly passgeneration_keyandthinking_endparameters, and correctly omit the removedthinking_beginparameter. The function modifies the dict in-place to set the_{generation_key}_finished_thinkingflag and preserve the full generation in_full_{generation_key}, which aligns with the PR's changes to thinking-token handling.Also applies to: 195-200
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
nemo_skills/inference/eval/swebench.py (2)
150-152: Clarify or remove unused eval_type field.The
eval_typefield definition suggests it's configurable, but the validation in__init__rejects any non-None value. This creates API confusion for users who might attempt to configure evaluation.Consider one of the following approaches:
- Remove the
eval_typeandeval_configfields entirely fromSweBenchGenerationConfigif they're not applicable to SWE-bench.- Add a comment in the field definition clarifying that these fields exist for interface consistency but are not used for SWE-bench.
- Move these fields to a base config class and have SWE-bench override/validate them appropriately.
Apply this diff to remove the unused fields:
- # Evaluation setup if requested. If eval_type is set to None, evaluation is skipped - eval_type: str | None = None # "lean4-proof", "math", etc. - eval_config: dict = field(default_factory=dict) # Config for the evaluator -And remove the corresponding validation:
- if self.cfg.eval_type is not None: - raise ValueError( - "SWE-bench generation task does not support eval_type parameter. Evaluation is done automatically." - ) -Also applies to: 176-179
181-182: Remove unused evaluation flags.The
should_run_evaluationandevaluatorattributes are initialized but never used in the code. Since SWE-bench performs evaluation inline during generation (inprocess_single_datapoint), these flags are unnecessary.Apply this diff to remove the unused attributes:
- self.should_run_evaluation = False - self.evaluator = None -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/basics/index.md(1 hunks)nemo_skills/dataset/simpleqa/__init__.py(1 hunks)nemo_skills/evaluation/evaluator/mcq.py(2 hunks)nemo_skills/inference/eval/swebench.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- nemo_skills/dataset/simpleqa/init.py
- nemo_skills/evaluation/evaluator/mcq.py
- docs/basics/index.md
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/eval/swebench.py (2)
nemo_skills/utils.py (4)
get_help_message(340-360)get_logger_name(38-42)nested_dataclass(68-101)setup_logging(104-139)nemo_skills/inference/generate.py (1)
evaluate_single_datapoint(592-598)
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/swebench.py
177-179: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/inference/eval/swebench.py (2)
33-38: LGTM!The import consolidation improves readability and follows Python conventions.
199-201: LGTM!The method rename to
evaluate_single_datapointaligns with the base class interface. The no-op implementation is appropriate since SWE-bench performs evaluation inline duringprocess_single_datapoint.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
…ions when remove_thinking=True (#958) Signed-off-by: Igor Gitman <igitman@nvidia.com>
…ions when remove_thinking=True (#958) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
batch_evaluate_results#840)++eval_type=no-op(now same behavior is achieved by simply not specifying eval_type)postprocess_single_outputthat has logic previously in dump_outputs that renames/removes certain fields. This is now called before interleaved evaluation resolving eval_single / eval_full should support a few more features like pre/postprocessing and remove-thinking #841_full_{generation_key}which allows not overriding _full_generation in judge jobsSummary by CodeRabbit
New Features
Documentation
Refactor
Behavior change