[torchao float8tensor] #1415

crcrpar · 2024-11-08T12:08:18Z

What does this PR do?

Improve the tensor subclass support of #1394 for TorchAo float8.

note: pytorch/ao#1339 is needed

my environment

torch: 2.6.0a0+git62eea62
nvfuser: 0.2.23+gitbb05859
torchao: 0.7.0+gitb2e42ff6
CUDA device: RTX 6000 Ada Generation
Driver Version: 560.35.03
CUDA Version: 12.6

t-vi · 2024-11-25T15:19:50Z

@crcrpar if you merge main, the pt nightly distributed ci tests should be fixed.

crcrpar · 2024-11-28T12:13:50Z

thunder/__init__.py

This change should be in #1394

crcrpar · 2024-11-28T12:14:16Z

thunder/core/jit_ext.py

@@ -637,7 +637,7 @@ def _convert_pytorchfunc_to_thundertrace(
    trace = TraceCtx()
    trace.bound_symbols.extend(active_jit_ctx.computation_trace.pop_scope())
    func_result = unwrap(wrapped_func_result)
-    if shallow_copy_output:
+    if shallow_copy_output and not trace.bound_symbols:


crcrpar · 2024-11-28T12:14:32Z

thunder/core/jit_ext.py

+
+    added_bsym: BoundSymbol = get_jit_ctx().computation_trace.scopes[-1][-1]
+    import_ctx, call_ctx, object_ctx = {}, {}, {}
+    for bsym in trace_of_fwd.bound_symbols:
+        cur_import_ctx, cur_call_ctx, cur_object_ctx = bsym.gather_ctxs()
+        import_ctx.update(cur_import_ctx)
+        call_ctx.update(cur_call_ctx)
+        object_ctx.update(cur_object_ctx)
+
+    if import_ctx:
+        added_bsym._import_ctx.update(import_ctx)
+    if call_ctx:
+        if added_bsym._call_ctx is not None:
+            added_bsym._call_ctx.update(call_ctx)
+        else:
+            added_bsym._call_ctx = call_ctx
+    if object_ctx:
+        added_bsym._object_ctx.update(object_ctx)


should be in #1394

crcrpar · 2024-11-28T12:15:09Z

thunder/core/prims.py

This change should also be in #1394

crcrpar · 2024-11-28T12:15:50Z

thunder/executors/torch_autograd.py

should be in #1394

Signed-off-by: Masaki Kozuki <[email protected]>

next, function with tensor creation in it Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]> revert wrong patch Signed-off-by: Masaki Kozuki <[email protected]> supply unpacks with traces generated within the lookaside Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

as torchao float8 ops table does not include `permute` but `transpose`. Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

``` E RuntimeError: While trying to flatten the following BoundSymbol: E t165 = manual_float8_matmul_with_args_in_float8_127532775308928_2(input_fp8, t164) # t165: "cuda:0 f32[16, 64]" E # t102 = ltorch.reshape(input_fp8, -1, 32) # t102: "cuda:0 f32[16, 32]" E # t102 = prims.reshape(input_fp8, (16, 32)) # t102: "cuda:0 f32[16, 32]" E # t103 = ltorch.spmm(t102, t164) # t103: "cuda:0 f32[16, 64]" E # t165 = prims.shallow_copy(t103) # t165: "cuda:0 f32[16, 64]" E Unsupported op of torch._scaled_mm found from E class <lambda>(torch.nn.Module): E def forward(self, arg0, arg1, arg2, arg3, arg4, arg5): E arg0_1: "f8e4m3fn[16, 32]"; arg1_1: "f32[]"; arg3_1: "f8e4m3fn[32, 64]"; arg4_1: "f32[]"; E E arg0_1, arg1_1, arg2_1, arg2_2, arg2_3, arg2_4, arg2_5, arg2_6, arg2_7, arg2_8, arg2_9, arg2_10, arg2_11, arg2_12, arg2_13, arg2_14, arg2_15, arg3_1, arg4_1, arg5_1, arg5_2, arg5_3, arg5_4, arg5_5, arg5_6, arg5_7, arg5_8, arg5_9, arg5_10, arg5_11, arg5_12, arg5_13, arg5_14, arg5_15, = fx_pytree.tree_flatten_spec([arg0, arg1, arg2, arg3, arg4, arg5], self._in_spec) E # No stacktrace found for following nodes E view: "f8e4m3fn[16, 32]" = torch.ops.aten.view.default(arg0_1, [-1, 32]); arg0_1 = None E t: "f8e4m3fn[64, 32]" = torch.ops.aten.t.default(arg3_1); arg3_1 = None E clone: "f8e4m3fn[64, 32]" = torch.ops.aten.clone.default(t, memory_format = torch.contiguous_format); t = None E t_1: "f8e4m3fn[32, 64]" = torch.ops.aten.t.default(clone); clone = None E reciprocal: "f32[]" = torch.ops.aten.reciprocal.default(arg1_1); arg1_1 = None E reciprocal_1: "f32[]" = torch.ops.aten.reciprocal.default(arg4_1); arg4_1 = None E _scaled_mm: "f32[16, 64]" = torch.ops.aten._scaled_mm.default(view, t_1, reciprocal, reciprocal_1, None, None, torch.float32, True); view = t_1 = reciprocal = reciprocal_1 = None E return pytree.tree_unflatten([_scaled_mm, None], self._out_spec) thunder/transforms/tensor_subclasses.py:299: RuntimeError ``` Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

still failing as `_scaled_mm` requires the secomd matrix to be column major: ``` E NotImplementedError: Failing to map `torch._scaled_mm` to `thunder.torch` op of [Symbol name=_scaled_mm] with args of [<TensorProxy(name="t166", dtype=thunder.dtypes.float8_e4m3fn, shape=(16, 32))>, <TensorProxy(name="t169", dtype=thunder.dtypes.float8_e4m3fn, shape=(32, 64))>, <TensorProxy(name="t170", dtype=thunder.dtypes.float32, shape=())>, <TensorProxy(name="t171", dtype=thunder.dtypes.float32, shape=())>, None, None, torch.float32, True] E BoundSymbol in question is E ```python E t165 = manual_float8_matmul_with_args_in_float8_127377658692416_2(input_fp8, t164) # t165: "cuda:0 f32[16, 64]" E # t102 = ltorch.reshape(input_fp8, -1, 32) # t102: "cuda:0 f32[16, 32]" E # t102 = prims.reshape(input_fp8, (16, 32)) # t102: "cuda:0 f32[16, 32]" E # t103 = ltorch.spmm(t102, t164) # t103: "cuda:0 f32[16, 64]" E # t165 = prims.shallow_copy(t103) # t165: "cuda:0 f32[16, 64]" E ``` E Corresponding torch.fx Graph is E ```python E class <lambda>(torch.nn.Module): E def forward(self, arg0, arg1, arg2, arg3, arg4, arg5): E arg0_1: "f8e4m3fn[16, 32]"; arg1_1: "f32[]"; arg3_1: "f8e4m3fn[32, 64]"; arg4_1: "f32[]"; E E arg0_1, arg1_1, arg2_1, arg2_2, arg2_3, arg2_4, arg2_5, arg2_6, arg2_7, arg2_8, arg2_9, arg2_10, arg2_11, arg2_12, arg2_13, arg2_14, arg2_15, arg3_1, arg4_1, arg5_1, arg5_2, arg5_3, arg5_4, arg5_5, arg5_6, arg5_7, arg5_8, arg5_9, arg5_10, arg5_11, arg5_12, arg5_13, arg5_14, arg5_15, = fx_pytree.tree_flatten_spec([arg0, arg1, arg2, arg3, arg4, arg5], self._in_spec) E # No stacktrace found for following nodes E view: "f8e4m3fn[16, 32]" = torch.ops.aten.view.default(arg0_1, [-1, 32]); arg0_1 = None E t: "f8e4m3fn[64, 32]" = torch.ops.aten.t.default(arg3_1); arg3_1 = None E clone: "f8e4m3fn[64, 32]" = torch.ops.aten.clone.default(t, memory_format = torch.contiguous_format); t = None E t_1: "f8e4m3fn[32, 64]" = torch.ops.aten.t.default(clone); clone = None E reciprocal: "f32[]" = torch.ops.aten.reciprocal.default(arg1_1); arg1_1 = None E reciprocal_1: "f32[]" = torch.ops.aten.reciprocal.default(arg4_1); arg4_1 = None E _scaled_mm: "f32[16, 64]" = torch.ops.aten._scaled_mm.default(view, t_1, reciprocal, reciprocal_1, None, None, torch.float32, True); view = t_1 = reciprocal = reciprocal_1 = None E return pytree.tree_unflatten([_scaled_mm, None], self._out_spec) E E ``` E Original error is Exception encountered when doing automatic registration for _scaled_mm, please use manual registration: RuntimeError('mat2 must be col_major') ``` Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

…latten__` Signed-off-by: Masaki Kozuki <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Masaki Kozuki <[email protected]>

crcrpar · 2024-11-28T13:13:26Z

thunder/tests/test_tensor_subclass.py

+    if executor == DynamoThunderExecutor:
+        with pytest.raises(AssertionError):
+            torch.testing.assert_close(actual, expected)


This failure doesn't feel easy to fix to me. So I made this into a script:

import torch import torch.nn as nn from torchao.float8 import convert_to_float8_training from thunder.dynamo import ThunderCompiler from thunder.dynamo.splitter import SubgraphInfo from thunder.tests.make_tensor import make_tensor def main(): batch_size, in_features, out_features = 16, 32, 64 device = torch.device("cuda") dtype = torch.float32 model = nn.Linear(in_features, out_features, bias=False, device=device, dtype=dtype) fp8_model = convert_to_float8_training(model) x = make_tensor((batch_size, in_features), device=device, dtype=dtype) expected = fp8_model(x) backend = ThunderCompiler() jitted = torch.compile(fp8_model, backend=backend) actual = jitted(x) backend.save_reproducer_to_folder("./debug_torchao_with_thunderfx", use_pytest_benchmark=True) print(f"{len(backend.subgraph_infos) = }") subgraph: SubgraphInfo for subgraph in backend.subgraph_infos: print(f"# {len(subgraph.thunder_compiled_fns) = }") torch.testing.assert_close(actual, expected) if __name__ == "__main__": main()

note that pytorch/ao#1339 is needed at the moment.

Below, I put the console output of the script above:

% python debug_thunderfx_torchao_fp8.py /home/mkozuki/ghq/github.com/Lightning-AI/lightning-thunder/thunder/dynamo/compiler.py:21: UserWarning: The ThunderCompiler is in active development and may not work as expected. Please report any issues you encounter to the Lightning Thunder team. warnings.warn( len(backend.subgraph_infos) = 1 # len(subgraph.thunder_compiled_fns) = 0 Traceback (most recent call last): File "/home/mkozuki/ghq/github.com/Lightning-AI/lightning-thunder/debug_thunderfx_torchao_fp8.py", line 34, in <module> main() File "/home/mkozuki/ghq/github.com/Lightning-AI/lightning-thunder/debug_thunderfx_torchao_fp8.py", line 30, in main torch.testing.assert_close(actual, expected) File "/home/mkozuki/ghq/github.com/crcrpar/pytorch/torch/testing/_comparison.py", line 1530, in assert_close raise error_metas[0].to_error(msg) AssertionError: Tensor-likes are not close! Mismatched elements: 388 / 1024 (37.9%) Greatest absolute difference: 0.18639898300170898 at index (1, 61) (up to 1e-05 allowed) Greatest relative difference: 1.9664803743362427 at index (10, 33) (up to 1.3e-06 allowed)

So it seems that thunder.jit isn't used for this program but the numeric is diverging.

Can you check the result to see if they stay the same between different invocations. (Maybe due to low precision, the results could be different).

expected = fp8_model(x) actual = fp8_model(x) torch.testing.assert_close(actual, expected)

But please add a comment why expected and actual are both from calling the same model rather than one model and a reference.

This comment was marked as outdated.

Sign in to view

crcrpar mentioned this pull request Nov 14, 2024

codeutils.to_printable does not seem capable of handling NamedTuple #1442

Closed

crcrpar force-pushed the crpa/subclass-tensor-ops branch from 3fa8e2d to d5fb9fe Compare November 19, 2024 06:41

crcrpar force-pushed the crpa/subclass-torchao_float8tensor branch from abf0167 to e7ca8b7 Compare November 21, 2024 03:14

This comment was marked as outdated.

Sign in to view

crcrpar force-pushed the crpa/subclass-torchao_float8tensor branch 2 times, most recently from 896b631 to 316327f Compare November 24, 2024 16:13

crcrpar force-pushed the crpa/subclass-tensor-ops branch from d5fb9fe to 15c8d12 Compare November 26, 2024 07:22

crcrpar force-pushed the crpa/subclass-torchao_float8tensor branch from c87a36c to 0de44ee Compare November 26, 2024 07:22

This was referenced Nov 27, 2024

check scale.ndim before applying t/transpose pytorch/ao#1339

Open

Unrolling tensor subclasses in fwd/bwd split #1489

Merged

crcrpar commented Nov 28, 2024

View reviewed changes

crcrpar force-pushed the crpa/subclass-tensor-ops branch from 15c8d12 to 70dc6ba Compare November 28, 2024 12:31

crcrpar added 15 commits November 28, 2024 21:31

phase 1 for backward test

9cc2c75

Signed-off-by: Masaki Kozuki <[email protected]>

check backward is runnable with subclass arguments

03c3158

next, function with tensor creation in it Signed-off-by: Masaki Kozuki <[email protected]>

bwd run with tensor creation inside of trace

07f2cd8

Signed-off-by: Masaki Kozuki <[email protected]>

flatten Function.apply of converter

8f1b0d9

Signed-off-by: Masaki Kozuki <[email protected]>

torchao small test

bd7f1af

Signed-off-by: Masaki Kozuki <[email protected]>

placeholder-ish attributes/methods for _make_wrapper_subclass

4dd3427

Signed-off-by: Masaki Kozuki <[email protected]>

[autograd.Function lookaside] dce to wipe out redundant bsyms

332f3f5

Signed-off-by: Masaki Kozuki <[email protected]>

some tweaks

d03093a

Signed-off-by: Masaki Kozuki <[email protected]>

revert pytree changes

f4f99db

Signed-off-by: Masaki Kozuki <[email protected]>

imports for tensor subclass ctor

36f1772

Signed-off-by: Masaki Kozuki <[email protected]>

define bind-postprocess

451f9b8

Signed-off-by: Masaki Kozuki <[email protected]>

xfail, for now

ba52db7

Signed-off-by: Masaki Kozuki <[email protected]>

fix type set creation and add bsym postprocess for torchex

d116c1b

Signed-off-by: Masaki Kozuki <[email protected]>

printer translating thunder dtype/device to torchs

eac769e

Signed-off-by: Masaki Kozuki <[email protected]>

crcrpar and others added 28 commits November 28, 2024 21:31

meticulously set import_ctx of cls

127034a

Signed-off-by: Masaki Kozuki <[email protected]>

dry

e2edce1

Signed-off-by: Masaki Kozuki <[email protected]>

test failure info update

4c8257e

Signed-off-by: Masaki Kozuki <[email protected]>

cosmetic

b05e1f2

Signed-off-by: Masaki Kozuki <[email protected]>

better repr & type string

a3e1ab0

Signed-off-by: Masaki Kozuki <[email protected]>

use transpose instead of permute

c2bec21

as torchao float8 ops table does not include `permute` but `transpose`. Signed-off-by: Masaki Kozuki <[email protected]>

better typestring

7803f91

Signed-off-by: Masaki Kozuki <[email protected]>

num bsyms check

c0dcb7f

Signed-off-by: Masaki Kozuki <[email protected]>

allow tensor subclasses with non-empty metadata

3787e7a

Signed-off-by: Masaki Kozuki <[email protected]>

bsyms is a list inside trace_from_bsym_or_bsyms

7f56661

Signed-off-by: Masaki Kozuki <[email protected]>

the typestring has syntactic mistake; remove for now

440176b

Signed-off-by: Masaki Kozuki <[email protected]>

add torch._scaled_mm to auto register

dbb938e

Signed-off-by: Masaki Kozuki <[email protected]>

update error message of missing op

aecedb2

Signed-off-by: Masaki Kozuki <[email protected]>

tree_flatten tensor subclass metadata values

9a8c75c

Signed-off-by: Masaki Kozuki <[email protected]>

make error msg verbose

8cc2b2a

Signed-off-by: Masaki Kozuki <[email protected]>

better effor message for failing map from fx node to ltorch op

ae7923e

Signed-off-by: Masaki Kozuki <[email protected]>

register scaled_mm

f427b01

Signed-off-by: Masaki Kozuki <[email protected]>

note where new bsyms come from, especially torch dispatch

c94a62b

Signed-off-by: Masaki Kozuki <[email protected]>

cast fx.immutable_{dict, list} to dict/list

e3b85f7

Signed-off-by: Masaki Kozuki <[email protected]>

printer and bind_postprocess for __tensor_flatten__ & `__tensor_unf…

5845756

…latten__` Signed-off-by: Masaki Kozuki <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c2283c

for more information, see https://pre-commit.ci

xfail reason

5703afe

Signed-off-by: Masaki Kozuki <[email protected]>

cosmetic

2994479

Signed-off-by: Masaki Kozuki <[email protected]>

simplify subclass output handling

c878332

Signed-off-by: Masaki Kozuki <[email protected]>

Unrolling tensor subclasses in fwd/bwd split (#1489)

14ccf6b

Signed-off-by: Masaki Kozuki <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

reduce return values by one

804bc99

Signed-off-by: Masaki Kozuki <[email protected]>

crcrpar force-pushed the crpa/subclass-torchao_float8tensor branch from 04d528a to 804bc99 Compare November 28, 2024 12:32

crcrpar commented Nov 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchao float8tensor] #1415

[torchao float8tensor] #1415

crcrpar commented Nov 8, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

t-vi commented Nov 25, 2024

crcrpar Nov 28, 2024

crcrpar Nov 28, 2024

crcrpar Nov 28, 2024

crcrpar Nov 28, 2024

crcrpar Nov 28, 2024

crcrpar Nov 28, 2024

kshitij12345 Nov 28, 2024

t-vi Nov 28, 2024

[torchao float8tensor] #1415

Are you sure you want to change the base?

[torchao float8tensor] #1415

Conversation

crcrpar commented Nov 8, 2024 • edited Loading

What does this PR do?

my environment

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

t-vi commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crcrpar commented Nov 8, 2024 •

edited

Loading