Skip to content

Move RL batch-invariant tests to A10G + Fix RL CI#3041

Closed
wwwjn wants to merge 6 commits intomainfrom
rl-ci-tests
Closed

Move RL batch-invariant tests to A10G + Fix RL CI#3041
wwwjn wants to merge 6 commits intomainfrom
rl-ci-tests

Conversation

@wwwjn
Copy link
Copy Markdown
Contributor

@wwwjn wwwjn commented Apr 21, 2026

Move batch-invariant tests from H100 to A10G to relief H100 CI starvation.
Depend on pytorch side change: pytorch/pytorch#179760

We will need VLLM_USE_FLASHINFER_SAMPLER=0 because vllm-project/vllm#40376 landed Apr. 29. For our CI environment, we didn't install nvcc so it won't support FlashInfer to be JIT compiled.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 21, 2026
@wwwjn wwwjn marked this pull request as ready for review May 4, 2026 14:50
@wwwjn wwwjn requested review from fegin, tianyu-l and wconstab as code owners May 4, 2026 14:50
@wwwjn wwwjn changed the title Move RL batch-invariant tests to A10G Move RL batch-invariant tests to A10G + Fix RL CI May 4, 2026
@wwwjn
Copy link
Copy Markdown
Contributor Author

wwwjn commented May 5, 2026

This PR will be separate into 2:

  1. Fix CI only: [rl] Fix CI loss=0 and logprob=NaN #3232
  2. Move bit-wise parity test to A10G. TBA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants