Fixing ifbench by Kipok · Pull Request #921 · NVIDIA-NeMo/Skills

Kipok · 2025-10-09T18:40:44Z

Summary by CodeRabbit

New Features
- Automatic download of required NLP resources during build for smoother out-of-the-box evaluations.
Bug Fixes
- Improved emoji and punctuation handling in instruction checks to avoid false positives/negatives.
- More robust handling of empty sentences and end-of-file newlines.
Refactor
- Evaluations now write to per-file temporary directories, simplifying result management and cleanup.
Chores
- Streamlined build steps and applied patch workflow to ensure consistent setup.
- Cleanup now removes entire temporary metric directories, preventing leftover files.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2025-10-09T18:42:57Z

Walkthrough

Introduces a patch-based setup for IFBench in the Dockerfile, adds NLTK and spaCy resource downloads, and modifies IFBench instructions to skip model download. Updates IFBench and IFEval evaluators to use per-file temporary output directories for metrics, read results from there, and clean up directories after processing.

Changes

Cohort / File(s)	Summary
IFBench setup and patching `dockerfiles/Dockerfile.nemo-skills`, `dockerfiles/ifbench.patch`	Replace inline IFBench deps install with post-patch consolidated install; download NLTK data and spaCy model in Docker build. Patch IFBench to skip spaCy model download at runtime and harden EmojiSentenceChecker; minor newline handling in NoWhitespaceChecker.
Evaluators temporary metrics dir `nemo_skills/evaluation/evaluator/ifbench.py`, `nemo_skills/evaluation/evaluator/ifeval.py`	For each input JSONL, create a sibling `{stem}_metrics_tmp` directory for eval outputs; run evaluation targeting this dir; read loose/strict results from it; augment samples; remove the temp directory via `shutil.rmtree`. Adds `shutil` imports and path resolution adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as Evaluator (IFBench/IFEval)
  participant FS as Filesystem
  participant Tool as IFBench/IFEval CLI

  Eval->>FS: Resolve jsonl_path
  Eval->>FS: Create {stem}_metrics_tmp (output_dir)
  Eval->>Tool: Run evaluation with --output_dir=output_dir
  Tool-->>FS: Write eval_results_loose.jsonl, eval_results_strict.jsonl
  Eval->>FS: Read eval results from output_dir
  Eval->>FS: Write augmented samples (loose/strict)
  Eval->>FS: Remove output_dir (shutil.rmtree)
  Note over Eval,Tool: Runtime model download skipped (spaCy preinstalled in image)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I patched my burrow with careful hops,
Packed NLTK seeds and spaCy crops.
I stash my crumbs in temp-made nooks,
Read the scores, then clean the books.
Metrics vanish—poof!—no trace to keep;
A tidy warren makes reviews a leap. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “Fixing ifbench” is related to the overall scope of the changeset but is too generic to convey the specific updates such as the new patch workflow, consolidated dependency installation, and temporary output directory handling introduced for IFBench. It does not clearly summarize the primary technical changes made in this pull request.	Consider renaming the title to something more descriptive, for example “Add patch workflow, consolidated installation, and temporary output directory for IFBench,” to clearly reflect the main changes introduced.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/ifbench-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

nemo_skills/evaluation/evaluator/ifeval.py (2)

37-37: Consider future hardening of subprocess call.

The shell=True parameter is flagged by static analysis (S602). While the risk is mitigated here since jsonl_file comes from controlled configuration rather than untrusted input, consider refactoring to use a list of arguments without shell expansion for defense in depth.

Example refactor:

-        cmd = (
-            "cd /opt/benchmarks/google-research && python -m instruction_following_eval.evaluation_main "
-            f"--input_data={jsonl_file} "
-            f"--input_response_data={jsonl_file} "
-            f"--output_dir={output_dir} "
-        )
-        subprocess.run(cmd, shell=True, check=True)
+        subprocess.run(
+            [
+                "python", "-m", "instruction_following_eval.evaluation_main",
+                f"--input_data={jsonl_file}",
+                f"--input_response_data={jsonl_file}",
+                f"--output_dir={output_dir}",
+            ],
+            cwd="/opt/benchmarks/google-research",
+            check=True,
+        )

44-44: Add strict=True to zip() calls for data integrity.

Both zip() calls on lines 44 and 49 lack explicit strict= parameters. If the evaluation produces a different number of results than input samples (e.g., due to partial failure or bugs), the mismatch will be silently truncated rather than raising an error.

Apply this diff:

-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["loose_eval"] = eval_result
 
         with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f:
             eval_results = [json.loads(line) for line in f]
-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["strict_eval"] = eval_result

Also applies to: 49-49

nemo_skills/evaluation/evaluator/ifbench.py (3)

37-37: Consider future hardening of subprocess call.

Similar to the change in ifeval.py, the shell=True parameter is flagged by static analysis (S602). While acceptable here with controlled input, consider refactoring to use argument lists for consistency and defense in depth.

Example refactor:

-        cmd = (
-            "cd /opt/benchmarks/IFBench && python -m run_eval "
-            f"--input_data={jsonl_file} "
-            f"--input_response_data={jsonl_file} "
-            f"--output_dir={output_dir} "
-        )
-        subprocess.run(cmd, shell=True, check=True)
+        subprocess.run(
+            [
+                "python", "-m", "run_eval",
+                f"--input_data={jsonl_file}",
+                f"--input_response_data={jsonl_file}",
+                f"--output_dir={output_dir}",
+            ],
+            cwd="/opt/benchmarks/IFBench",
+            check=True,
+        )

44-44: Add strict=True to zip() calls for data integrity.

Both zip() calls lack explicit strict= parameters. If evaluation results and samples have mismatched lengths, the issue will be silently truncated. This matches the same concern in ifeval.py.

Apply this diff:

-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["loose_eval"] = eval_result
 
         with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f:
             eval_results = [json.loads(line) for line in f]
-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["strict_eval"] = eval_result

Also applies to: 49-49

26-57: Consider extracting common evaluation pattern.

The eval_ifbench function shares significant structure with eval_if in ifeval.py (temp directory creation, subprocess call, result fusion, cleanup). Consider extracting a common helper function to reduce duplication.

Example structure:

def _run_evaluation(
    jsonl_file: str,
    benchmark_dir: str,
    module_name: str,
) -> None:
    """Common evaluation pattern for IF-style benchmarks."""
    jsonl_path = Path(jsonl_file).resolve()
    output_dir = jsonl_path.parent / f"{jsonl_path.stem}_metrics_tmp"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # ... common logic ...
    
    shutil.rmtree(output_dir)

This is a minor maintainability improvement and can be deferred.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69faf84 and 948f9a8.

📒 Files selected for processing (4)

dockerfiles/Dockerfile.nemo-skills (1 hunks)
dockerfiles/ifbench.patch (1 hunks)
nemo_skills/evaluation/evaluator/ifbench.py (3 hunks)
nemo_skills/evaluation/evaluator/ifeval.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

nemo_skills/evaluation/evaluator/ifeval.py

37-37: subprocess call with shell=True identified, security issue

(S602)

44-44: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/ifbench.py

37-37: subprocess call with shell=True identified, security issue

(S602)

44-44: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (7)

nemo_skills/evaluation/evaluator/ifeval.py (2)

28-30: LGTM! Clean per-file output isolation.

The temporary directory approach properly isolates evaluation outputs for each input file, preventing cross-file contamination. The naming scheme {stem}_metrics_tmp is clear and deterministic.

Also applies to: 35-35

56-57: LGTM! Directory cleanup is robust.

Replacing individual file deletions with shutil.rmtree(output_dir) is cleaner and ensures all temporary outputs are removed, including any unexpected files the evaluation might generate.

dockerfiles/Dockerfile.nemo-skills (1)

48-54: LGTM! Patch-based setup with consolidated resource downloads.

The workflow is correct:

Apply patch to skip the problematic inline download

Install all required dependencies

Download NLTK and spaCy resources separately

This approach properly addresses the setup issues while ensuring all necessary resources are available.

dockerfiles/ifbench.patch (2)

9-12: LGTM! Download skip aligns with Dockerfile setup.

Skipping the inline model download and assuming it's pre-downloaded is correct. The Dockerfile handles the download separately at build time (line 54 in Dockerfile.nemo-skills), ensuring the resource is available.

20-21: LGTM! Defensive guards prevent IndexError on empty strings.

The added guards correctly handle edge cases where sentences become empty after punctuation removal. Without these checks, accessing stripped[-1] (line 22) or stripped[0] (line 31) would raise an IndexError. This improves the checker's robustness.

Also applies to: 28-30

nemo_skills/evaluation/evaluator/ifbench.py (2)

28-30: LGTM! Consistent per-file output isolation.

The temporary directory approach matches the pattern in ifeval.py, ensuring consistent behavior across evaluators. The naming scheme and directory creation are correct.

Also applies to: 35-35

56-57: LGTM! Directory cleanup is robust.

The cleanup approach is correct and consistent with ifeval.py.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Kipok added 3 commits October 9, 2025 10:02

Avoid collisions from parallel evals

d01560c

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix folder creation

30e360f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add ifbench patch

948f9a8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Kipok merged commit ae9c8f5 into main Oct 9, 2025
7 checks passed

Kipok deleted the igitman/ifbench-fix branch October 9, 2025 19:06

Glorf pushed a commit that referenced this pull request Oct 10, 2025

Fixing ifbench (#921)

226d3c6

Signed-off-by: Igor Gitman <igitman@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

Fixing ifbench (#921)

6fe5956

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot mentioned this pull request Dec 4, 2025

fix: IFBench error handling and build improvements #1073

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Fixing ifbench (#921)

510115d

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing ifbench#921

Fixing ifbench#921
Kipok merged 3 commits intomainfrom
igitman/ifbench-fix

Kipok commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kipok commented Oct 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kipok commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading