Skip to content

Fixes for slurm tests#1192

Merged
gwarmstrong merged 13 commits intomainfrom
igitman/slurm-test-fixes
Jan 29, 2026
Merged

Fixes for slurm tests#1192
gwarmstrong merged 13 commits intomainfrom
igitman/slurm-test-fixes

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Jan 27, 2026

  1. Removed bfcl from llama-nemotron as its parser is no longer supported
  2. Adjust bfcl install to be non-editable (was failing in cron job)
  3. Improve stability for scicode download by switching to HF path (was failing with rate limits)
  4. Adjust different constraints / fix typos after some api changes

Summary by CodeRabbit

  • Chores

    • Updated dataset preparation to use HuggingFace dataset loading for SciCode.
    • Improved virtual environment setup with deterministic seeding for test reproducibility.
    • Removed editable installation flag from BFCL runtime setup.
  • Tests

    • Updated evaluation metric ranges and validation thresholds.
    • Refined tool-calling evaluation configuration and removed redundant checks.
    • Adjusted model evaluation framework for improved testing workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

Kipok added 8 commits January 25, 2026 16:40
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Important Files Changed

Filename Overview
nemo_skills/dataset/bfcl_v3/prepare.py Removed -e flag from pip install to make BFCL non-editable, fixing cron job failures
nemo_skills/dataset/gsm-plus/prepare.py Switched from urllib to HuggingFace load_dataset API, cleaner and more maintainable
nemo_skills/dataset/scicode/prepare.py Switched to HF API to avoid rate limits. File writing pattern keeps test_aai.jsonl open during nested writes, but works correctly
tests/slurm-tests/clone_and_run.sh Added --seed flag for reproducible venv and PATH fix for better binary resolution
tests/slurm-tests/qwen3_4b_evals/check_results.py Fixed metric paths after BFCL API changes, removed nested category structure
tests/slurm-tests/qwen3_4b_evals/run_test.py Changed model name to Qwen3-8B-FC for parser compatibility, removed commented code
tests/slurm-tests/super_49b_evals/check_results.py Removed BFCL tool-calling checks and adjusted livecodebench lower bound from 27.5 to 26.0
tests/slurm-tests/super_49b_evals/run_test.py Removed BFCL evaluation tasks from both reasoning on/off modes

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

The changes modify dataset preparation, testing infrastructure, and evaluation configurations. Key updates include: removing editable installs for BFCL, replacing local dataset downloads with HuggingFace dataset loading for SciCode, adding deterministic seed to virtual environment creation, updating metric ranges and structure for eval tests, and removing BFCL evaluation from the super_49b test suite.

Changes

Cohort / File(s) Summary
Dataset Preparation
nemo_skills/dataset/bfcl_v3/prepare.py, nemo_skills/dataset/scicode/prepare.py
Removed editable install flag (-e) from BFCL pip install command. SciCode preparation replaced with HuggingFace dataset load; added split_mapping to translate "validation" → "dev" and "test" → "test"; consolidated data gathering with single all_data accumulator for final concatenation.
Testing Infrastructure
tests/slurm-tests/clone_and_run.sh
Added --seed flag to uv venv creation for deterministic environment seeding; derived VENV_BIN from python executable directory and prepended to PATH.
Evaluation Metric Configuration
tests/slurm-tests/qwen3_4b_evals/check_results.py, tests/slurm-tests/super_49b_evals/check_results.py
Qwen3: flattened TOOLCALLING_METRIC_RANGES from nested structure to flat keys; updated accuracy ranges and broadened "overall_multi_turn" range to (20.0, 32.0). Super_49b: updated "reasoning_off" livecodebench range from (27.5, 32.5) to (26.0, 32.5); removed get_nested_value import and check_toolcalling function.
Evaluation Execution
tests/slurm-tests/qwen3_4b_evals/run_test.py, tests/slurm-tests/super_49b_evals/run_test.py
Qwen3: hardcoded model name to "Qwen/Qwen3-8B-FC"; collapsed f-string formatting; removed commented skip_filled configuration. Super_49b: removed BFCL evaluation blocks from reasoning ON/OFF workflows and prepare_data context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run GPU tests

Suggested reviewers

  • ludwig-n
  • activatedgeek
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fixes for slurm tests' directly relates to the primary changes in the pull request, which involve multiple fixes to SLURM test infrastructure and related preparation scripts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple comments/questions

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +28 to +35
with open(test_aai_file, "w", encoding="utf-8") as test_aai_fout:
for hf_split, output_split in split_mapping.items():
output_file = data_dir / f"{output_split}.jsonl"
with open(output_file, "wt", encoding="utf-8") as fout:
for entry in dataset[hf_split]:
line = json.dumps(entry) + "\n"
fout.write(line)
test_aai_fout.write(line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File test_aai.jsonl is being opened and written to in outer scope, but also written within nested loop for each split. The outer with block keeps the file open during the entire loop, meaning both dev.jsonl and test.jsonl entries are written sequentially. However, there's no guarantee the file is properly synced since writes happen concurrently through both fout and test_aai_fout file handles. While this may work, it's cleaner to write to test_aai.jsonl after the split files are complete.

Additionally, per CONTRIBUTING.md guidelines (line 40-42), avoid data loss by completing computation before writing. Consider writing all files separately first, then concatenating to avoid partial writes if there's a failure.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gwarmstrong gwarmstrong enabled auto-merge (squash) January 29, 2026 18:17
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gwarmstrong gwarmstrong merged commit 3290c2a into main Jan 29, 2026
5 checks passed
@gwarmstrong gwarmstrong deleted the igitman/slurm-test-fixes branch January 29, 2026 18:33
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants