Make scaling type configurable for MoE training #2642

danielvegamyhre · 2025-07-30T23:19:30Z

Stacked PRs:

->Make scaling type configurable for MoE training #2642

Make scaling type configurable for MoE training

Summary

Update user facing MoE conversion api to make scaling type configurable
Note: after Make token group alignment size configurable torchtitan#1503 lands making token group alignment size configurable, I don't think we'll actually need "per token group" scaling for mxfp8 for using torchtitan, since the scaling groups will no longer cross token group boundaries. However, for now I am leaving this in, since it's still numerically equivalent and we may need this functionality for other pretraining frameworks/models - it's still early days.

Test plan

Added integration test using torchtitan w/ changes to make token group size alignment configurable

pytorch-bot · 2025-07-30T23:19:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2642

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

## Summary - For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for `grad_weight = grad_output_t @ input`, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879 - To solve this, this PR makes the token group M aligment configurable. ## Test plan - Integration test with torchao passes: pytorch/ao#2642 - Did manual test run with llama4 debug model using bf16

- For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for `grad_weight = grad_output_t @ input`, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879 - To solve this, this PR makes the token group M aligment configurable. - Integration test with torchao passes: pytorch/ao#2642 - Did manual test run with llama4 debug model using bf16

vkuzo

lg for prototype, we might need to change this later

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

## Summary - For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for `grad_weight = grad_output_t @ input`, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879 - To solve this, this PR makes the token group M aligment configurable. ## Test plan - Integration test with torchao passes: pytorch/ao#2642 - Did manual test run with llama4 debug model using bf16

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre added a commit that referenced this pull request Jul 30, 2025

Make scaling type configurable for MoE training

4fbf578

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from 507b6cc to 4fbf578 Compare July 30, 2025 23:19

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2025

This was referenced Jul 30, 2025

support for 2d-2d emulated mxfp8 grouped gemm #2632

Merged

backward pass for differentiable mxfp8 grouped gemm with dynamic quant #2639

Merged

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 30, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/25 to main July 31, 2025 17:24

danielvegamyhre added a commit that referenced this pull request Jul 31, 2025

Make scaling type configurable for MoE training

1434e9b

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from 4fbf578 to 1434e9b Compare July 31, 2025 17:24

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/25 July 31, 2025 17:24

danielvegamyhre changed the base branch from danielvegamyhre/stack/25 to main July 31, 2025 17:33

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from 1434e9b to a5403ac Compare July 31, 2025 17:33

danielvegamyhre added a commit that referenced this pull request Jul 31, 2025

Make scaling type configurable for MoE training

a5403ac

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/25 July 31, 2025 17:33

danielvegamyhre mentioned this pull request Jul 31, 2025

Make token group alignment size configurable pytorch/torchtitan#1503

Merged

danielvegamyhre changed the base branch from danielvegamyhre/stack/25 to main July 31, 2025 22:31

danielvegamyhre added a commit that referenced this pull request Jul 31, 2025

Make scaling type configurable for MoE training

a828d09

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from a5403ac to a828d09 Compare July 31, 2025 22:31

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/25 July 31, 2025 22:31

vkuzo approved these changes Aug 1, 2025

View reviewed changes

danielvegamyhre changed the base branch from danielvegamyhre/stack/25 to main August 1, 2025 18:37

danielvegamyhre added a commit that referenced this pull request Aug 1, 2025

Make scaling type configurable for MoE training

82e707e

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from a828d09 to 82e707e Compare August 1, 2025 18:37

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/25 August 1, 2025 18:37

danielvegamyhre force-pushed the danielvegamyhre/stack/25 branch from e9ba18b to 2aabb15 Compare August 1, 2025 20:18

danielvegamyhre added a commit that referenced this pull request Aug 1, 2025

Make scaling type configurable for MoE training

1b362ee

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from 82e707e to 1b362ee Compare August 1, 2025 20:18

danielvegamyhre added a commit that referenced this pull request Aug 1, 2025

Make scaling type configurable for MoE training

bb05933

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from 1b362ee to bb05933 Compare August 1, 2025 20:18

danielvegamyhre changed the base branch from danielvegamyhre/stack/25 to main August 1, 2025 20:18

danielvegamyhre added a commit that referenced this pull request Aug 4, 2025

Make scaling type configurable for MoE training

df8adf3

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from bb05933 to df8adf3 Compare August 4, 2025 16:00

Make scaling type configurable for MoE training

a221a9e

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre force-pushed the danielvegamyhre/stack/26 branch from df8adf3 to a221a9e Compare August 5, 2025 15:46

danielvegamyhre merged commit 418593c into main Aug 5, 2025
4 checks passed

liangel-02 pushed a commit that referenced this pull request Aug 25, 2025

Make scaling type configurable for MoE training (#2642)

da1e160

stack-info: PR: #2642, branch: danielvegamyhre/stack/26

danielvegamyhre added the moe label Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make scaling type configurable for MoE training #2642

Make scaling type configurable for MoE training #2642

Uh oh!

danielvegamyhre commented Jul 30, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

vkuzo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make scaling type configurable for MoE training #2642

Make scaling type configurable for MoE training #2642

Uh oh!

Conversation

danielvegamyhre commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2642

❗ 1 Active SEVs

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Jul 30, 2025 •

edited

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading