Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
WalkthroughIntroduces a patch-based setup for IFBench in the Dockerfile, adds NLTK and spaCy resource downloads, and modifies IFBench instructions to skip model download. Updates IFBench and IFEval evaluators to use per-file temporary output directories for metrics, read results from there, and clean up directories after processing. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Eval as Evaluator (IFBench/IFEval)
participant FS as Filesystem
participant Tool as IFBench/IFEval CLI
Eval->>FS: Resolve jsonl_path
Eval->>FS: Create {stem}_metrics_tmp (output_dir)
Eval->>Tool: Run evaluation with --output_dir=output_dir
Tool-->>FS: Write eval_results_loose.jsonl, eval_results_strict.jsonl
Eval->>FS: Read eval results from output_dir
Eval->>FS: Write augmented samples (loose/strict)
Eval->>FS: Remove output_dir (shutil.rmtree)
Note over Eval,Tool: Runtime model download skipped (spaCy preinstalled in image)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
nemo_skills/evaluation/evaluator/ifeval.py (2)
37-37: Consider future hardening of subprocess call.The
shell=Trueparameter is flagged by static analysis (S602). While the risk is mitigated here sincejsonl_filecomes from controlled configuration rather than untrusted input, consider refactoring to use a list of arguments without shell expansion for defense in depth.Example refactor:
- cmd = ( - "cd /opt/benchmarks/google-research && python -m instruction_following_eval.evaluation_main " - f"--input_data={jsonl_file} " - f"--input_response_data={jsonl_file} " - f"--output_dir={output_dir} " - ) - subprocess.run(cmd, shell=True, check=True) + subprocess.run( + [ + "python", "-m", "instruction_following_eval.evaluation_main", + f"--input_data={jsonl_file}", + f"--input_response_data={jsonl_file}", + f"--output_dir={output_dir}", + ], + cwd="/opt/benchmarks/google-research", + check=True, + )
44-44: Addstrict=Trueto zip() calls for data integrity.Both
zip()calls on lines 44 and 49 lack explicitstrict=parameters. If the evaluation produces a different number of results than input samples (e.g., due to partial failure or bugs), the mismatch will be silently truncated rather than raising an error.Apply this diff:
- for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["loose_eval"] = eval_result with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f: eval_results = [json.loads(line) for line in f] - for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["strict_eval"] = eval_resultAlso applies to: 49-49
nemo_skills/evaluation/evaluator/ifbench.py (3)
37-37: Consider future hardening of subprocess call.Similar to the change in
ifeval.py, theshell=Trueparameter is flagged by static analysis (S602). While acceptable here with controlled input, consider refactoring to use argument lists for consistency and defense in depth.Example refactor:
- cmd = ( - "cd /opt/benchmarks/IFBench && python -m run_eval " - f"--input_data={jsonl_file} " - f"--input_response_data={jsonl_file} " - f"--output_dir={output_dir} " - ) - subprocess.run(cmd, shell=True, check=True) + subprocess.run( + [ + "python", "-m", "run_eval", + f"--input_data={jsonl_file}", + f"--input_response_data={jsonl_file}", + f"--output_dir={output_dir}", + ], + cwd="/opt/benchmarks/IFBench", + check=True, + )
44-44: Addstrict=Trueto zip() calls for data integrity.Both
zip()calls lack explicitstrict=parameters. If evaluation results and samples have mismatched lengths, the issue will be silently truncated. This matches the same concern inifeval.py.Apply this diff:
- for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["loose_eval"] = eval_result with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f: eval_results = [json.loads(line) for line in f] - for sample, eval_result in zip(samples, eval_results): + for sample, eval_result in zip(samples, eval_results, strict=True): sample["strict_eval"] = eval_resultAlso applies to: 49-49
26-57: Consider extracting common evaluation pattern.The
eval_ifbenchfunction shares significant structure witheval_ifinifeval.py(temp directory creation, subprocess call, result fusion, cleanup). Consider extracting a common helper function to reduce duplication.Example structure:
def _run_evaluation( jsonl_file: str, benchmark_dir: str, module_name: str, ) -> None: """Common evaluation pattern for IF-style benchmarks.""" jsonl_path = Path(jsonl_file).resolve() output_dir = jsonl_path.parent / f"{jsonl_path.stem}_metrics_tmp" output_dir.mkdir(parents=True, exist_ok=True) # ... common logic ... shutil.rmtree(output_dir)This is a minor maintainability improvement and can be deferred.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dockerfiles/Dockerfile.nemo-skills(1 hunks)dockerfiles/ifbench.patch(1 hunks)nemo_skills/evaluation/evaluator/ifbench.py(3 hunks)nemo_skills/evaluation/evaluator/ifeval.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/ifeval.py
37-37: subprocess call with shell=True identified, security issue
(S602)
44-44: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/evaluation/evaluator/ifbench.py
37-37: subprocess call with shell=True identified, security issue
(S602)
44-44: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (7)
nemo_skills/evaluation/evaluator/ifeval.py (2)
28-30: LGTM! Clean per-file output isolation.The temporary directory approach properly isolates evaluation outputs for each input file, preventing cross-file contamination. The naming scheme
{stem}_metrics_tmpis clear and deterministic.Also applies to: 35-35
56-57: LGTM! Directory cleanup is robust.Replacing individual file deletions with
shutil.rmtree(output_dir)is cleaner and ensures all temporary outputs are removed, including any unexpected files the evaluation might generate.dockerfiles/Dockerfile.nemo-skills (1)
48-54: LGTM! Patch-based setup with consolidated resource downloads.The workflow is correct:
- Apply patch to skip the problematic inline download
- Install all required dependencies
- Download NLTK and spaCy resources separately
This approach properly addresses the setup issues while ensuring all necessary resources are available.
dockerfiles/ifbench.patch (2)
9-12: LGTM! Download skip aligns with Dockerfile setup.Skipping the inline model download and assuming it's pre-downloaded is correct. The Dockerfile handles the download separately at build time (line 54 in Dockerfile.nemo-skills), ensuring the resource is available.
20-21: LGTM! Defensive guards prevent IndexError on empty strings.The added guards correctly handle edge cases where sentences become empty after punctuation removal. Without these checks, accessing
stripped[-1](line 22) orstripped[0](line 31) would raise an IndexError. This improves the checker's robustness.Also applies to: 28-30
nemo_skills/evaluation/evaluator/ifbench.py (2)
28-30: LGTM! Consistent per-file output isolation.The temporary directory approach matches the pattern in
ifeval.py, ensuring consistent behavior across evaluators. The naming scheme and directory creation are correct.Also applies to: 35-35
56-57: LGTM! Directory cleanup is robust.The cleanup approach is correct and consistent with
ifeval.py.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores