[cleanup][2/x] split float8 mm by delayed vs dynamic #1461

vkuzo · 2024-12-27T21:42:03Z

Summary:

Before this PR, the float8 mm logic was split by axiswise vs tensorwise.
After this PR, the float8 mm logic is split by dynamic vs non-dynamic scaling.

Motivation: there is more and more evidence that dynamic scaling will
be common to the most important lowp recipes. This PR is a step on the way
to making the dynamic scaling logic be simpler and easier to understand
in torchao.float8.

There are a lot of other simplifications to do, but stopping here to
keep the PR small. This is a pure refactor without any logic changes.

Test Plan:

./test/float8/test_everything.sh

// torchtitan LLaMa 3 8B on 8 H100s

// baseline (bf16 + compile)
37.87GiB(39.86%)  tps: 5,787

// before this PR

// dynamic-tensorwise
memory: 37.98GiB(39.98%)  tps: 6,880
// dynamic-tensorwise with float8 all-gather
memory: 37.54GiB(39.51%)  tps: 7,126
// all-axiswise
memory: 57.03GiB(60.03%)  tps: 6,330
// lw_axiswise_with_gw_hp
memory: 57.03GiB(60.03%)  tps: 6,021

// after this PR
// dynamic-tensorwise
memory: 37.98GiB(39.98%)  tps: 6,891
// dynamic-tensorwise with float8 all-gather
memory: 37.54GiB(39.51%)  tps: 7,127
// all-axiswise
memory: 57.03GiB(60.03%)  tps: 6,345
// lw_axiswise_with_gw_hp
memory: 57.03GiB(60.03%)  tps: 6,042

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2024-12-27T21:42:04Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2024-12-27T21:42:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1461

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Before this PR, the float8 mm logic was split by axiswise vs tensorwise. After this PR, the float8 mm logic is split by dynamic vs non-dynamic scaling. Motivation: there is more and more evidence that dynamic scaling will be common to the most important lowp recipes. This PR is a step on the way to making the dynamic scaling logic be simpler and easier to understand in `torchao.float8`. There are a lot of other simplifications to do, but stopping here to keep the PR small. This is a pure refactor without any logic changes. Test Plan: ``` ./test/float8/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8a9792272e3aeea12d705eb5a466b9830dac6420 ghstack-comment-id: 2564049757 Pull Request resolved: #1461

[ghstack-poisoned]

Summary: Before this PR, the float8 mm logic was split by axiswise vs tensorwise. After this PR, the float8 mm logic is split by dynamic vs non-dynamic scaling. Motivation: there is more and more evidence that dynamic scaling will be common to the most important lowp recipes. This PR is a step on the way to making the dynamic scaling logic be simpler and easier to understand in `torchao.float8`. There are a lot of other simplifications to do, but stopping here to keep the PR small. This is a pure refactor without any logic changes. Test Plan: ``` ./test/float8/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 90b042fc76549c88a46abf06945ad166ea403e5c ghstack-comment-id: 2564049757 Pull Request resolved: #1461

[ghstack-poisoned]

drisspg · 2025-01-13T16:27:22Z

torchao/float8/float8_linear.py

+    config: Float8LinearConfig,
+) -> Optional[torch.Tensor]:
+    if tensor_already_casted_to_fp8(weight):
+        return None


I figured it would have just returned the scale on the fp8 tensor

[ghstack-poisoned]

vkuzo added 3 commits December 23, 2024 13:32

Update

09821f0

[ghstack-poisoned]

Update

fb3b255

[ghstack-poisoned]

Update

e32188c

[ghstack-poisoned]

vkuzo mentioned this pull request Dec 27, 2024

[cleanup][1/x] make hp_tensor_to_float8_dynamic only work with hp inputs #1458

Merged

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 27, 2024

vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Dec 27, 2024

Update

3ea9803

[ghstack-poisoned]

This was referenced Jan 2, 2025

[cleanup][3/x] unify dynamic input and grad_output casting #1480

Merged

[cleanup][4/x] unify weight casting #1481

Open

vkuzo added 2 commits January 8, 2025 13:07

Update

e6bc640

[ghstack-poisoned]

Update

c9f3a3f

[ghstack-poisoned]

vkuzo requested review from jerryzh168 and drisspg January 13, 2025 15:59

vkuzo added 2 commits January 13, 2025 08:00

Update

c857ff7

[ghstack-poisoned]

Update

d6b1af5

[ghstack-poisoned]

drisspg reviewed Jan 13, 2025

View reviewed changes

drisspg approved these changes Jan 13, 2025

View reviewed changes

vkuzo changed the base branch from gh/vkuzo/13/head to main January 13, 2025 20:13

Update

855e974

[ghstack-poisoned]

vkuzo merged commit 2ec9bc1 into main Jan 13, 2025
28 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cleanup][2/x] split float8 mm by delayed vs dynamic #1461

[cleanup][2/x] split float8 mm by delayed vs dynamic #1461

vkuzo commented Dec 27, 2024 •

edited

Loading

vkuzo commented Dec 27, 2024 •

edited

Loading

pytorch-bot bot commented Dec 27, 2024 •

edited

Loading

drisspg Jan 13, 2025

[cleanup][2/x] split float8 mm by delayed vs dynamic #1461

[cleanup][2/x] split float8 mm by delayed vs dynamic #1461

Conversation

vkuzo commented Dec 27, 2024 • edited Loading

vkuzo commented Dec 27, 2024 • edited Loading

pytorch-bot bot commented Dec 27, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1461

drisspg Jan 13, 2025

Choose a reason for hiding this comment

vkuzo commented Dec 27, 2024 •

edited

Loading

vkuzo commented Dec 27, 2024 •

edited

Loading

pytorch-bot bot commented Dec 27, 2024 •

edited

Loading