fix tensor parallelism for float8 training with rowwise scaling #1718

vkuzo · 2025-02-14T23:33:14Z

Summary:

add a test for toy model + TP + float8 rowwise scaling training
fix underlying issues to make the test pass: a. add fast path for tensor view where the new shape is the same as old shape, for rowwise scaled float8 (this is needed for DTensor) b. modify the fake grad dependency workaround to work when grad is a DTensor

Test Plan:

./test/float8/test_everything.sh (one transient failure: https://www.internalfb.com/phabricator/paste/view/P1733103301)
verified that float8 rowwise scaling behaves sanely in torchtitan on LLaMa 3 8B on 8 H100s, with tp 2:

// requires https://github.com/pytorch/torchtitan/pull/808

// baseline - bfloat16 + compile + tp 2
> with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.tensor_parallel_degree 2 --training.compile
[rank0]:2025-02-14 13:41:16,175 - root - INFO - step: 40  loss:  7.4240 memory: 35.56GiB(37.43%)  tps: 1,669  mfu: 9.77%

// float8 baseline - float8 tensorwise + compile + tp 2
> with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.tensor_parallel_degree 2 --training.compile
[rank0]:2025-02-14 13:44:07,806 - root - INFO - step: 40  loss:  7.4993 memory: 35.57GiB(37.44%)  tps: 2,141  mfu: 12.54%

// float8 rowwise without zero fake dep (for sanity) + compile + tp 2
> with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.tensor_parallel_degree 2 --training.compile --float8.recipe_name all_axiswise
[rank0]:2025-02-14 13:47:51,400 - root - INFO - step: 40  loss:  7.3472 memory: 35.55GiB(37.42%)  tps: 1,858  mfu: 10.88%

// float8 rowwise + compile + tp 2
> with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.tensor_parallel_degree 2 --training.compile --float8.recipe_name all_axiswise
[rank0]:2025-02-14 13:51:20,864 - root - INFO - step: 40  loss:  9.4211 memory: 35.55GiB(37.42%)  tps: 1,820  mfu: 10.66%

// loss curves with deterministic mode on match closely:
// bf16 + compile - https://www.internalfb.com/phabricator/paste/view/P1733138907
// float8 rowwise + compile - https://www.internalfb.com/phabricator/paste/view/P1733131520

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-02-14T23:33:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1718

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre

LGTM, left a couple minor comments

danielvegamyhre · 2025-02-18T00:14:50Z

test/float8/test_dtensor.py

@@ -196,14 +213,25 @@ def _test_fp8_mlp_tensor_parallelism_base(
    sp_model = copy.deepcopy(toy_model)
    sp_model = convert_to_float8_training(sp_model, config=config)

+    # for tensorwise scaling, enable float8 all_gather
+    # for rowwise scaling, keep high precision all_gather


Can we expand this comment to explain the reasoning behind this (why fp8 all gather for tensorwise and HP all gather for rowwise)?

danielvegamyhre · 2025-02-18T00:21:19Z

test/float8/test_dtensor.py

+        prepare_input = prepare_input_cls(
+            input_layouts=Shard(1),
+            desired_input_layouts=Replicate(),
+            fwd_config_submodule_fqn="w2",


So this is saying we use the forward config from the FFN w2 linear layer to perform the fp8 conversion on the inputs? If so, why specifically w2?

danielvegamyhre · 2025-02-18T00:26:20Z

torchao/float8/float8_linear.py

@@ -169,7 +169,9 @@ def backward(ctx, grad_output):
                # workaround from https://github.com/pytorch/pytorch/issues/141881
                # to avoid saving float8 weight from forward to backward when
                # FSDP is on
-                weight_hp_t = weight_hp_t + (grad_output_reshaped[0, 0] * 0)
+                g_reshaped = grad_output.reshape(-1, grad_output.shape[-1]) * 0


nit: since this workaround is now different than the one in the github issue referenced in the comment, it would be helpful to update the comment explaining how this modified workaround fixes the interaction with TP.

Summary: 1. add a test for toy model + TP + float8 rowwise scaling training 2. fix underlying issues to make the test pass: a. add fast path for tensor view where the new shape is the same as old shape, for rowwise scaled float8 (this is needed for DTensor) b. modify the fake grad dependency workaround to work when grad is a DTensor Test Plan: 1. ./test/float8/test_everything.sh (one transient failure: https://www.internalfb.com/phabricator/paste/view/P1733103301) 2. verified that float8 rowwise scaling behaves sanely in torchtitan on LLaMa 3 8B on 8 H100s, with tp 2: ``` // requires pytorch/torchtitan#808 // baseline - bfloat16 + compile + tp 2 > with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.tensor_parallel_degree 2 --training.compile [rank0]:2025-02-14 13:41:16,175 - root - INFO - step: 40 loss: 7.4240 memory: 35.56GiB(37.43%) tps: 1,669 mfu: 9.77% // float8 baseline - float8 tensorwise + compile + tp 2 > with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.tensor_parallel_degree 2 --training.compile [rank0]:2025-02-14 13:44:07,806 - root - INFO - step: 40 loss: 7.4993 memory: 35.57GiB(37.44%) tps: 2,141 mfu: 12.54% // float8 rowwise without zero fake dep (for sanity) + compile + tp 2 > with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.tensor_parallel_degree 2 --training.compile --float8.recipe_name all_axiswise [rank0]:2025-02-14 13:47:51,400 - root - INFO - step: 40 loss: 7.3472 memory: 35.55GiB(37.42%) tps: 1,858 mfu: 10.88% // float8 rowwise + compile + tp 2 > with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --training.tensor_parallel_degree 2 --training.compile --float8.recipe_name all_axiswise [rank0]:2025-02-14 13:51:20,864 - root - INFO - step: 40 loss: 9.4211 memory: 35.55GiB(37.42%) tps: 1,820 mfu: 10.66% ``` Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 14, 2025

vkuzo requested review from drisspg and danielvegamyhre February 14, 2025 23:34

vkuzo added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Feb 14, 2025

vkuzo mentioned this pull request Feb 14, 2025

enable float8 rowwise scaling + async TP, and ensure rowwise scaling uses bf16 all_gather pytorch/torchtitan#845

Closed

tianyu-l linked an issue Feb 16, 2025 that may be closed by this pull request

enable float8 rowwise scaling + async TP, and ensure rowwise scaling uses bf16 all_gather pytorch/torchtitan#845

Closed

danielvegamyhre approved these changes Feb 18, 2025

View reviewed changes

vkuzo force-pushed the 20250214_float8_rowwise_improvements branch from 2ccf537 to f4adfb0 Compare February 18, 2025 21:13

vkuzo merged commit 988c5c9 into main Feb 18, 2025
6 of 16 checks passed

vkuzo mentioned this pull request Feb 18, 2025

verify performance and numerics of float8 training with rowwise scaling #1732

Open

This was referenced Feb 20, 2025

[Float8] Unable to run asyncTP + Float8 row with 'full' AC active, leading dims mismatch pytorch/torchtitan#864

Closed

[Float8] Float8 rowwise with vanilla TP encounters NaN around 80 iters in... pytorch/torchtitan#865

Open

lessw2020 mentioned this pull request Feb 20, 2025

[Float8] Rowwise with AsyncTP runs at roughly same perf as vanilla TP pytorch/torchtitan#866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix tensor parallelism for float8 training with rowwise scaling #1718

fix tensor parallelism for float8 training with rowwise scaling #1718

vkuzo commented Feb 14, 2025 •

edited

Loading

pytorch-bot bot commented Feb 14, 2025 •

edited

Loading

danielvegamyhre left a comment

danielvegamyhre Feb 18, 2025

danielvegamyhre Feb 18, 2025

danielvegamyhre Feb 18, 2025

fix tensor parallelism for float8 training with rowwise scaling #1718

fix tensor parallelism for float8 training with rowwise scaling #1718

Conversation

vkuzo commented Feb 14, 2025 • edited Loading

pytorch-bot bot commented Feb 14, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1718

danielvegamyhre left a comment

Choose a reason for hiding this comment

danielvegamyhre Feb 18, 2025

Choose a reason for hiding this comment

danielvegamyhre Feb 18, 2025

Choose a reason for hiding this comment

danielvegamyhre Feb 18, 2025

Choose a reason for hiding this comment

vkuzo commented Feb 14, 2025 •

edited

Loading

pytorch-bot bot commented Feb 14, 2025 •

edited

Loading