Fixes for slurm tests by Kipok · Pull Request #1192 · NVIDIA-NeMo/Skills

Kipok · 2026-01-27T22:09:47Z

Removed bfcl from llama-nemotron as its parser is no longer supported
Adjust bfcl install to be non-editable (was failing in cron job)
Improve stability for scicode download by switching to HF path (was failing with rate limits)
Adjust different constraints / fix typos after some api changes

Summary by CodeRabbit

Chores
- Updated dataset preparation to use HuggingFace dataset loading for SciCode.
- Improved virtual environment setup with deterministic seeding for test reproducibility.
- Removed editable installation flag from BFCL runtime setup.
Tests
- Updated evaluation metric ranges and validation thresholds.
- Refined tool-calling evaluation configuration and removed redundant checks.
- Adjusted model evaluation framework for improved testing workflows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps · 2026-01-27T22:13:13Z

Important Files Changed

Filename	Overview
nemo_skills/dataset/bfcl_v3/prepare.py	Removed `-e` flag from pip install to make BFCL non-editable, fixing cron job failures
nemo_skills/dataset/gsm-plus/prepare.py	Switched from urllib to HuggingFace `load_dataset` API, cleaner and more maintainable
nemo_skills/dataset/scicode/prepare.py	Switched to HF API to avoid rate limits. File writing pattern keeps test_aai.jsonl open during nested writes, but works correctly
tests/slurm-tests/clone_and_run.sh	Added --seed flag for reproducible venv and PATH fix for better binary resolution
tests/slurm-tests/qwen3_4b_evals/check_results.py	Fixed metric paths after BFCL API changes, removed nested category structure
tests/slurm-tests/qwen3_4b_evals/run_test.py	Changed model name to Qwen3-8B-FC for parser compatibility, removed commented code
tests/slurm-tests/super_49b_evals/check_results.py	Removed BFCL tool-calling checks and adjusted livecodebench lower bound from 27.5 to 26.0
tests/slurm-tests/super_49b_evals/run_test.py	Removed BFCL evaluation tasks from both reasoning on/off modes

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai · 2026-01-27T22:17:10Z

📝 Walkthrough

Walkthrough

The changes modify dataset preparation, testing infrastructure, and evaluation configurations. Key updates include: removing editable installs for BFCL, replacing local dataset downloads with HuggingFace dataset loading for SciCode, adding deterministic seed to virtual environment creation, updating metric ranges and structure for eval tests, and removing BFCL evaluation from the super_49b test suite.

Changes

Cohort / File(s)	Summary
Dataset Preparation `nemo_skills/dataset/bfcl_v3/prepare.py`, `nemo_skills/dataset/scicode/prepare.py`	Removed editable install flag (-e) from BFCL pip install command. SciCode preparation replaced with HuggingFace dataset load; added split_mapping to translate "validation" → "dev" and "test" → "test"; consolidated data gathering with single all_data accumulator for final concatenation.
Testing Infrastructure `tests/slurm-tests/clone_and_run.sh`	Added --seed flag to uv venv creation for deterministic environment seeding; derived VENV_BIN from python executable directory and prepended to PATH.
Evaluation Metric Configuration `tests/slurm-tests/qwen3_4b_evals/check_results.py`, `tests/slurm-tests/super_49b_evals/check_results.py`	Qwen3: flattened TOOLCALLING_METRIC_RANGES from nested structure to flat keys; updated accuracy ranges and broadened "overall_multi_turn" range to (20.0, 32.0). Super_49b: updated "reasoning_off" livecodebench range from (27.5, 32.5) to (26.0, 32.5); removed get_nested_value import and check_toolcalling function.
Evaluation Execution `tests/slurm-tests/qwen3_4b_evals/run_test.py`, `tests/slurm-tests/super_49b_evals/run_test.py`	Qwen3: hardcoded model name to "Qwen/Qwen3-8B-FC"; collapsed f-string formatting; removed commented skip_filled configuration. Super_49b: removed BFCL evaluation blocks from reasoning ON/OFF workflows and prepare_data context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run GPU tests

Suggested reviewers

ludwig-n
activatedgeek

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fixes for slurm tests' directly relates to the primary changes in the pull request, which involve multiple fixes to SLURM test infrastructure and related preparation scripts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

gwarmstrong

just a couple comments/questions

nemo_skills/dataset/scicode/prepare.py

tests/slurm-tests/super_49b_evals/run_test.py

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-29T05:41:03Z

nemo_skills/dataset/scicode/prepare.py

+    with open(test_aai_file, "w", encoding="utf-8") as test_aai_fout:
+        for hf_split, output_split in split_mapping.items():
+            output_file = data_dir / f"{output_split}.jsonl"
+            with open(output_file, "wt", encoding="utf-8") as fout:
+                for entry in dataset[hf_split]:
+                    line = json.dumps(entry) + "\n"
+                    fout.write(line)
+                    test_aai_fout.write(line)


File test_aai.jsonl is being opened and written to in outer scope, but also written within nested loop for each split. The outer with block keeps the file open during the entire loop, meaning both dev.jsonl and test.jsonl entries are written sequentially. However, there's no guarantee the file is properly synced since writes happen concurrently through both fout and test_aai_fout file handles. While this may work, it's cleaner to write to test_aai.jsonl after the split files are complete.

Additionally, per CONTRIBUTING.md guidelines (line 40-42), avoid data loss by completing computation before writing. Consider writing all files separately first, then concatenating to avoid partial writes if there's a failure.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

gwarmstrong

LGTM

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Kipok added 8 commits January 25, 2026 16:40

Remove bfcl from llama nemotron tests

6dc97b8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix qwen3 bfcl test

9379b25

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix for clone and run

4fa081e

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix bfcl and scicode prepare

ab6e055

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Tmp change

1cc7892

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix bfcl check results

86d9c71

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update constraint

9fedaf2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix bfcl test

b1c56fb

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

Update gsm-plus to hf api

8315af2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

Fix bfcl test

487f0dd

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

gwarmstrong reviewed Jan 28, 2026

View reviewed changes

nemo_skills/dataset/scicode/prepare.py Outdated Show resolved Hide resolved

tests/slurm-tests/super_49b_evals/run_test.py Show resolved Hide resolved

Kipok added 2 commits January 28, 2026 21:34

Merge branch 'main' into igitman/slurm-test-fixes

092dbba

Optimize scicode data saving

2a55c27

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added the run GPU tests label Jan 29, 2026

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

Kipok mentioned this pull request Jan 29, 2026

Speeding up bfcl requirements installation #1191

Merged

gwarmstrong approved these changes Jan 29, 2026

View reviewed changes

Merge branch 'main' into igitman/slurm-test-fixes

4154099

gwarmstrong enabled auto-merge (squash) January 29, 2026 18:17

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

gwarmstrong merged commit 3290c2a into main Jan 29, 2026
5 checks passed

gwarmstrong deleted the igitman/slurm-test-fixes branch January 29, 2026 18:33

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Fixes for slurm tests (#1192)

3a45494

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Fixes for slurm tests (#1192)

1c06705

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Conversation

Kipok commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 27, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kipok commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Jan 27, 2026 •

edited

Loading