[rl] Move RL H100 CI to A10G by wwwjn · Pull Request #3259 · pytorch/torchtitan

wwwjn · 2026-05-07T14:18:54Z

Follow up on the RL CI fix PR: This is originally part of #3041 .

Move H100 RL test to A10G, as FAv2 can also achieve batch invariant with paged kv cache now.

- Consolidate RL CI into a single 8-GPU A10G workflow (delete separate H100 yaml). All tests now run on A10G. - Use TP=4 for default tests (up from TP=2). - Add rl_grpo_qwen3_debug_batch_invariant config: tiny debugmodel (dim=256, 8 layers, vocab=2048) with random init. No HF checkpoint needed, fits on A10G (22 GB) without OOM. - Add bitwise parity test to the CI pipeline.

wwwjn requested review from fegin, tianyu-l and wconstab as code owners May 7, 2026 14:18

pytorch-bot Bot added ciflow/8gpu ciflow/rl labels May 7, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 7, 2026

wwwjn changed the title ~~[rl] Move~~ [rl] Move RL H100 CI to A10G May 7, 2026

wwwjn added 3 commits May 7, 2026 07:19

fix CI unit test

f3f1b26

remove H100

79d4063

wwwjn force-pushed the rl-batch-invariant-a10g branch from 502b523 to 79d4063 Compare May 7, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rl] Move RL H100 CI to A10G#3259

[rl] Move RL H100 CI to A10G#3259
wwwjn wants to merge 3 commits intomainfrom
rl-batch-invariant-a10g

wwwjn commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wwwjn commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant