Skip to content

[rl] Fix CI loss=0 and logprob=NaN#3232

Merged
wwwjn merged 1 commit intopytorch:mainfrom
wwwjn:rl-ci-fix
May 6, 2026
Merged

[rl] Fix CI loss=0 and logprob=NaN#3232
wwwjn merged 1 commit intopytorch:mainfrom
wwwjn:rl-ci-fix

Conversation

@wwwjn
Copy link
Copy Markdown
Contributor

@wwwjn wwwjn commented May 5, 2026

3 Fixes:

  1. uv pip install torchmonarch==0.4.1 update to matching README
  2. export VLLM_USE_FLASHINFER_SAMPLER=0. We will need VLLM_USE_FLASHINFER_SAMPLER=0 because [Perf] Enable FlashInfer top-k/top-p sampler by default vllm-project/vllm#40376 landed Apr. 29. For our CI environment, we didn't install nvcc so it won't support FlashInfer to be JIT compiled.
  3. temporarily set num_splits=1 for FAv2. If not set, it caused NaN results during vllm generation.

@wwwjn wwwjn requested review from fegin, tianyu-l and wconstab as code owners May 5, 2026 21:47
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2026
# After pytorch/pytorch#179760, FA2 also accepts num_splits and
# auto-selects num_splits>1 for paged KV, which can produce NaN.
# Always set num_splits=1 for FA2 with paged KV.
if is_in_batch_invariant_mode() or current_flash_attention_impl() != "FA3":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always using num_splits=1 doesn't sound a fix.
Do we know what's the root cause? Are we tracking / creating issues? At least add a TODO here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what's the root cause?

Not yet, raise this error to @liangel-02 , let me add a issue to tracking this regression

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3235 I created this issue to track.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use something like == FA2 instead of != FA3? I don't know what's the set of options.

Comment thread .github/workflows/integration_test_4gpu_rl.yaml Outdated
@wwwjn wwwjn force-pushed the rl-ci-fix branch 2 times, most recently from 0d00d62 to 25b35f9 Compare May 6, 2026 01:42
Comment thread .github/workflows/integration_test_4gpu_rl.yaml Outdated
# After pytorch/pytorch#179760, FA2 also accepts num_splits and
# auto-selects num_splits>1 for paged KV, which can produce NaN.
# Always set num_splits=1 for FA2 with paged KV.
if is_in_batch_invariant_mode() or current_flash_attention_impl() != "FA3":
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3235 I created this issue to track.

@wwwjn wwwjn requested a review from tianyu-l May 6, 2026 02:18
Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address comments

# After pytorch/pytorch#179760, FA2 also accepts num_splits and
# auto-selects num_splits>1 for paged KV, which can produce NaN.
# Always set num_splits=1 for FA2 with paged KV.
if is_in_batch_invariant_mode() or current_flash_attention_impl() != "FA3":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use something like == FA2 instead of != FA3? I don't know what's the set of options.

Comment thread torchtitan/experiments/rl/models/attention.py
@wwwjn wwwjn merged commit e62ab09 into pytorch:main May 6, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants