[rl] Fix CI loss=0 and logprob=NaN by wwwjn · Pull Request #3232 · pytorch/torchtitan

wwwjn · 2026-05-05T21:47:39Z

3 Fixes:

uv pip install torchmonarch==0.4.1 update to matching README
export VLLM_USE_FLASHINFER_SAMPLER=0. We will need VLLM_USE_FLASHINFER_SAMPLER=0 because [Perf] Enable FlashInfer top-k/top-p sampler by default vllm-project/vllm#40376 landed Apr. 29. For our CI environment, we didn't install nvcc so it won't support FlashInfer to be JIT compiled.
temporarily set num_splits=1 for FAv2. If not set, it caused NaN results during vllm generation.

tianyu-l · 2026-05-05T22:03:44Z

+        # After pytorch/pytorch#179760, FA2 also accepts num_splits and
+        # auto-selects num_splits>1 for paged KV, which can produce NaN.
+        # Always set num_splits=1 for FA2 with paged KV.
+        if is_in_batch_invariant_mode() or current_flash_attention_impl() != "FA3":


Always using num_splits=1 doesn't sound a fix.
Do we know what's the root cause? Are we tracking / creating issues? At least add a TODO here?

Do we know what's the root cause?

Not yet, raise this error to @liangel-02 , let me add a issue to tracking this regression

#3235 I created this issue to track.

can we use something like == FA2 instead of != FA3? I don't know what's the set of options.

wwwjn · 2026-05-06T02:17:33Z

+        # After pytorch/pytorch#179760, FA2 also accepts num_splits and
+        # auto-selects num_splits>1 for paged KV, which can produce NaN.
+        # Always set num_splits=1 for FA2 with paged KV.
+        if is_in_batch_invariant_mode() or current_flash_attention_impl() != "FA3":


#3235 I created this issue to track.

tianyu-l

please address comments

tianyu-l · 2026-05-06T04:33:19Z

+        # After pytorch/pytorch#179760, FA2 also accepts num_splits and
+        # auto-selects num_splits>1 for paged KV, which can produce NaN.
+        # Always set num_splits=1 for FA2 with paged KV.
+        if is_in_batch_invariant_mode() or current_flash_attention_impl() != "FA3":


can we use something like == FA2 instead of != FA3? I don't know what's the set of options.

wwwjn requested review from fegin, tianyu-l and wconstab as code owners May 5, 2026 21:47

pytorch-bot Bot added the ciflow/8gpu label May 5, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2026

wwwjn mentioned this pull request May 5, 2026

Move RL batch-invariant tests to A10G + Fix RL CI #3041

Closed

wwwjn force-pushed the rl-ci-fix branch from e8d54ca to fbab9a8 Compare May 5, 2026 21:48

tianyu-l reviewed May 5, 2026

View reviewed changes

wwwjn force-pushed the rl-ci-fix branch 2 times, most recently from 0d00d62 to 25b35f9 Compare May 6, 2026 01:42

wwwjn commented May 6, 2026

View reviewed changes

wwwjn requested a review from tianyu-l May 6, 2026 02:18

tianyu-l approved these changes May 6, 2026

View reviewed changes

add num_splits=1 for all FA2

d3cf06f

wwwjn force-pushed the rl-ci-fix branch from 25b35f9 to d3cf06f Compare May 6, 2026 13:43

wwwjn merged commit e62ab09 into pytorch:main May 6, 2026
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rl] Fix CI loss=0 and logprob=NaN#3232

[rl] Fix CI loss=0 and logprob=NaN#3232
wwwjn merged 1 commit intopytorch:mainfrom
wwwjn:rl-ci-fix

wwwjn commented May 5, 2026

Uh oh!

tianyu-l May 5, 2026

Uh oh!

wwwjn May 5, 2026

Uh oh!

wwwjn May 6, 2026

Uh oh!

tianyu-l May 6, 2026

Uh oh!

Uh oh!

Uh oh!

wwwjn May 6, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wwwjn commented May 5, 2026

Uh oh!

tianyu-l May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn May 5, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn May 6, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wwwjn May 6, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants