Skip to content

[rl] Move RL H100 CI to A10G#3259

Open
wwwjn wants to merge 3 commits intomainfrom
rl-batch-invariant-a10g
Open

[rl] Move RL H100 CI to A10G#3259
wwwjn wants to merge 3 commits intomainfrom
rl-batch-invariant-a10g

Conversation

@wwwjn
Copy link
Copy Markdown
Contributor

@wwwjn wwwjn commented May 7, 2026

Follow up on the RL CI fix PR: This is originally part of #3041 .

  • Move H100 RL test to A10G, as FAv2 can also achieve batch invariant with paged kv cache now.

@wwwjn wwwjn requested review from fegin, tianyu-l and wconstab as code owners May 7, 2026 14:18
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 7, 2026
@wwwjn wwwjn changed the title [rl] Move [rl] Move RL H100 CI to A10G May 7, 2026
wwwjn added 3 commits May 7, 2026 07:19
- Consolidate RL CI into a single 8-GPU A10G workflow (delete separate
  H100 yaml). All tests now run on A10G.
- Use TP=4 for default tests (up from TP=2).
- Add rl_grpo_qwen3_debug_batch_invariant config: tiny debugmodel
  (dim=256, 8 layers, vocab=2048) with random init. No HF checkpoint
  needed, fits on A10G (22 GB) without OOM.
- Add bitwise parity test to the CI pipeline.
@wwwjn wwwjn force-pushed the rl-batch-invariant-a10g branch from 502b523 to 79d4063 Compare May 7, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rl ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant