Skip to content

Fixing ifbench#921

Merged
Kipok merged 3 commits intomainfrom
igitman/ifbench-fix
Oct 9, 2025
Merged

Fixing ifbench#921
Kipok merged 3 commits intomainfrom
igitman/ifbench-fix

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Oct 9, 2025

Summary by CodeRabbit

  • New Features

    • Automatic download of required NLP resources during build for smoother out-of-the-box evaluations.
  • Bug Fixes

    • Improved emoji and punctuation handling in instruction checks to avoid false positives/negatives.
    • More robust handling of empty sentences and end-of-file newlines.
  • Refactor

    • Evaluations now write to per-file temporary directories, simplifying result management and cleanup.
  • Chores

    • Streamlined build steps and applied patch workflow to ensure consistent setup.
    • Cleanup now removes entire temporary metric directories, preventing leftover files.

Kipok added 3 commits October 9, 2025 10:02
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Introduces a patch-based setup for IFBench in the Dockerfile, adds NLTK and spaCy resource downloads, and modifies IFBench instructions to skip model download. Updates IFBench and IFEval evaluators to use per-file temporary output directories for metrics, read results from there, and clean up directories after processing.

Changes

Cohort / File(s) Summary
IFBench setup and patching
dockerfiles/Dockerfile.nemo-skills, dockerfiles/ifbench.patch
Replace inline IFBench deps install with post-patch consolidated install; download NLTK data and spaCy model in Docker build. Patch IFBench to skip spaCy model download at runtime and harden EmojiSentenceChecker; minor newline handling in NoWhitespaceChecker.
Evaluators temporary metrics dir
nemo_skills/evaluation/evaluator/ifbench.py, nemo_skills/evaluation/evaluator/ifeval.py
For each input JSONL, create a sibling {stem}_metrics_tmp directory for eval outputs; run evaluation targeting this dir; read loose/strict results from it; augment samples; remove the temp directory via shutil.rmtree. Adds shutil imports and path resolution adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as Evaluator (IFBench/IFEval)
  participant FS as Filesystem
  participant Tool as IFBench/IFEval CLI

  Eval->>FS: Resolve jsonl_path
  Eval->>FS: Create {stem}_metrics_tmp (output_dir)
  Eval->>Tool: Run evaluation with --output_dir=output_dir
  Tool-->>FS: Write eval_results_loose.jsonl, eval_results_strict.jsonl
  Eval->>FS: Read eval results from output_dir
  Eval->>FS: Write augmented samples (loose/strict)
  Eval->>FS: Remove output_dir (shutil.rmtree)
  Note over Eval,Tool: Runtime model download skipped (spaCy preinstalled in image)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I patched my burrow with careful hops,
Packed NLTK seeds and spaCy crops.
I stash my crumbs in temp-made nooks,
Read the scores, then clean the books.
Metrics vanish—poof!—no trace to keep;
A tidy warren makes reviews a leap. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “Fixing ifbench” is related to the overall scope of the changeset but is too generic to convey the specific updates such as the new patch workflow, consolidated dependency installation, and temporary output directory handling introduced for IFBench. It does not clearly summarize the primary technical changes made in this pull request. Consider renaming the title to something more descriptive, for example “Add patch workflow, consolidated installation, and temporary output directory for IFBench,” to clearly reflect the main changes introduced.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/ifbench-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
nemo_skills/evaluation/evaluator/ifeval.py (2)

37-37: Consider future hardening of subprocess call.

The shell=True parameter is flagged by static analysis (S602). While the risk is mitigated here since jsonl_file comes from controlled configuration rather than untrusted input, consider refactoring to use a list of arguments without shell expansion for defense in depth.

Example refactor:

-        cmd = (
-            "cd /opt/benchmarks/google-research && python -m instruction_following_eval.evaluation_main "
-            f"--input_data={jsonl_file} "
-            f"--input_response_data={jsonl_file} "
-            f"--output_dir={output_dir} "
-        )
-        subprocess.run(cmd, shell=True, check=True)
+        subprocess.run(
+            [
+                "python", "-m", "instruction_following_eval.evaluation_main",
+                f"--input_data={jsonl_file}",
+                f"--input_response_data={jsonl_file}",
+                f"--output_dir={output_dir}",
+            ],
+            cwd="/opt/benchmarks/google-research",
+            check=True,
+        )

44-44: Add strict=True to zip() calls for data integrity.

Both zip() calls on lines 44 and 49 lack explicit strict= parameters. If the evaluation produces a different number of results than input samples (e.g., due to partial failure or bugs), the mismatch will be silently truncated rather than raising an error.

Apply this diff:

-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["loose_eval"] = eval_result
 
         with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f:
             eval_results = [json.loads(line) for line in f]
-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["strict_eval"] = eval_result

Also applies to: 49-49

nemo_skills/evaluation/evaluator/ifbench.py (3)

37-37: Consider future hardening of subprocess call.

Similar to the change in ifeval.py, the shell=True parameter is flagged by static analysis (S602). While acceptable here with controlled input, consider refactoring to use argument lists for consistency and defense in depth.

Example refactor:

-        cmd = (
-            "cd /opt/benchmarks/IFBench && python -m run_eval "
-            f"--input_data={jsonl_file} "
-            f"--input_response_data={jsonl_file} "
-            f"--output_dir={output_dir} "
-        )
-        subprocess.run(cmd, shell=True, check=True)
+        subprocess.run(
+            [
+                "python", "-m", "run_eval",
+                f"--input_data={jsonl_file}",
+                f"--input_response_data={jsonl_file}",
+                f"--output_dir={output_dir}",
+            ],
+            cwd="/opt/benchmarks/IFBench",
+            check=True,
+        )

44-44: Add strict=True to zip() calls for data integrity.

Both zip() calls lack explicit strict= parameters. If evaluation results and samples have mismatched lengths, the issue will be silently truncated. This matches the same concern in ifeval.py.

Apply this diff:

-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["loose_eval"] = eval_result
 
         with open(output_dir / "eval_results_strict.jsonl", "rt", encoding="utf-8") as f:
             eval_results = [json.loads(line) for line in f]
-        for sample, eval_result in zip(samples, eval_results):
+        for sample, eval_result in zip(samples, eval_results, strict=True):
             sample["strict_eval"] = eval_result

Also applies to: 49-49


26-57: Consider extracting common evaluation pattern.

The eval_ifbench function shares significant structure with eval_if in ifeval.py (temp directory creation, subprocess call, result fusion, cleanup). Consider extracting a common helper function to reduce duplication.

Example structure:

def _run_evaluation(
    jsonl_file: str,
    benchmark_dir: str,
    module_name: str,
) -> None:
    """Common evaluation pattern for IF-style benchmarks."""
    jsonl_path = Path(jsonl_file).resolve()
    output_dir = jsonl_path.parent / f"{jsonl_path.stem}_metrics_tmp"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # ... common logic ...
    
    shutil.rmtree(output_dir)

This is a minor maintainability improvement and can be deferred.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69faf84 and 948f9a8.

📒 Files selected for processing (4)
  • dockerfiles/Dockerfile.nemo-skills (1 hunks)
  • dockerfiles/ifbench.patch (1 hunks)
  • nemo_skills/evaluation/evaluator/ifbench.py (3 hunks)
  • nemo_skills/evaluation/evaluator/ifeval.py (3 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/ifeval.py

37-37: subprocess call with shell=True identified, security issue

(S602)


44-44: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/evaluation/evaluator/ifbench.py

37-37: subprocess call with shell=True identified, security issue

(S602)


44-44: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (7)
nemo_skills/evaluation/evaluator/ifeval.py (2)

28-30: LGTM! Clean per-file output isolation.

The temporary directory approach properly isolates evaluation outputs for each input file, preventing cross-file contamination. The naming scheme {stem}_metrics_tmp is clear and deterministic.

Also applies to: 35-35


56-57: LGTM! Directory cleanup is robust.

Replacing individual file deletions with shutil.rmtree(output_dir) is cleaner and ensures all temporary outputs are removed, including any unexpected files the evaluation might generate.

dockerfiles/Dockerfile.nemo-skills (1)

48-54: LGTM! Patch-based setup with consolidated resource downloads.

The workflow is correct:

  1. Apply patch to skip the problematic inline download
  2. Install all required dependencies
  3. Download NLTK and spaCy resources separately

This approach properly addresses the setup issues while ensuring all necessary resources are available.

dockerfiles/ifbench.patch (2)

9-12: LGTM! Download skip aligns with Dockerfile setup.

Skipping the inline model download and assuming it's pre-downloaded is correct. The Dockerfile handles the download separately at build time (line 54 in Dockerfile.nemo-skills), ensuring the resource is available.


20-21: LGTM! Defensive guards prevent IndexError on empty strings.

The added guards correctly handle edge cases where sentences become empty after punctuation removal. Without these checks, accessing stripped[-1] (line 22) or stripped[0] (line 31) would raise an IndexError. This improves the checker's robustness.

Also applies to: 28-30

nemo_skills/evaluation/evaluator/ifbench.py (2)

28-30: LGTM! Consistent per-file output isolation.

The temporary directory approach matches the pattern in ifeval.py, ensuring consistent behavior across evaluators. The naming scheme and directory creation are correct.

Also applies to: 35-35


56-57: LGTM! Directory cleanup is robust.

The cleanup approach is correct and consistent with ifeval.py.

@Kipok Kipok merged commit ae9c8f5 into main Oct 9, 2025
7 checks passed
@Kipok Kipok deleted the igitman/ifbench-fix branch October 9, 2025 19:06
Glorf pushed a commit that referenced this pull request Oct 10, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant