Skip to content

Conversation

@ahmadki
Copy link
Member

@ahmadki ahmadki commented Nov 20, 2025

What does this PR do ?

sequence parallel + tp_size >1 is currently broken in torch==2.8.0, no point in running the test as it will raise an exception

Summary by CodeRabbit

  • Tests
    • Disabled a test pending resolution of a known compatibility issue.

✏️ Tip: You can customize this high-level summary in your review settings.

@ahmadki ahmadki requested a review from a team as a code owner November 20, 2025 21:55
@ahmadki ahmadki added the CI:L0 Run doctests and unit tests label Nov 20, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 20, 2025

📝 Walkthrough

Walkthrough

A single test case for sft-qwen2.5-32b in the nightly test suite is disabled with a TODO note referencing a torch==2.8.0 compatibility issue. The actual test invocation line is removed and replaced with a commented explanation.

Changes

Cohort / File(s) Summary
Test suite maintenance
tests/test_suites/nightly.txt
Disabled functional 32-bit test for sft-qwen2.5-32b; removed test invocation and added TODO comment noting torch==2.8.0 blocker

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

  • Single-file change with minimal scope
  • Simple documentation/disabling of a test case
  • Clear explanation provided via TODO comment

Possibly related PRs

Suggested labels

CI

Suggested reviewers

  • parthchadha
  • guyueh1
  • yfw

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: remove sft-qwen2.5-fsdp2tp8sp from nighlies' accurately describes the main change: disabling/removing a nightly test due to a torch 2.8.0 compatibility issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR removes a broken test in nightly suite due to known torch==2.8.0 issue. Minor maintenance change, not a major code change requiring test results documentation.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ahmadki/disable_llm_sft_qwen2_5_32b_4n8g_fsdp2tp8sp_actckpt_v3

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c371a9 and 3e8526c.

📒 Files selected for processing (1)
  • tests/test_suites/nightly.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/nightly.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (1)
tests/test_suites/nightly.txt (1)

68-69: Clear documentation and correct pattern for test disablement.

The change correctly disables the sft-qwen2.5-32b test with a well-formatted TODO comment that explains the root cause and references the tracking issue. This follows the established pattern in the file for disabled tests (see lines 26–27, 48–49, 92–94).

The GitHub issue #652 exists and is open with the title "Qwen parallelizer with sequence parallelism," confirming the reference is valid and accessible.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ahmadki ahmadki added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Nov 20, 2025
…lies

sequence parallel + tp_size >1 is currently broken in torch==2.8.0, no point in running the test as it will raise an exception

Signed-off-by: Ahmad Kiswani <[email protected]>
@ahmadki ahmadki force-pushed the ahmadki/disable_llm_sft_qwen2_5_32b_4n8g_fsdp2tp8sp_actckpt_v3 branch from 3e8526c to a08788a Compare November 26, 2025 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants