Skip to content

Port gfx12 native attention to ABI3 fork#92

Open
jammm wants to merge 7 commits into
woct0rdho:abi3_stablefrom
jammm:jam/gfx12-abi3
Open

Port gfx12 native attention to ABI3 fork#92
jammm wants to merge 7 commits into
woct0rdho:abi3_stablefrom
jammm:jam/gfx12-abi3

Conversation

@jammm
Copy link
Copy Markdown

@jammm jammm commented May 15, 2026

Main PR description can be seen at thu-ml#368

Summary

This ports the gfx12/RDNA4 SageAttention native extension work onto the abi3_stable fork branch.

Fork-specific changes:

  • Preserves the fork's stable ABI packaging model and builds a cp39-abi3 wheel.
  • Wires the gfx12 native ROCm extension into the ABI3 setup flow.
  • Adds TORCH_LIBRARY/torch.ops loading support for the gfx12 native kernels.
  • Adds fake/meta registrations needed by the fork's modern PyTorch extension style.
  • Keeps CUDA behavior intact while selecting the gfx12 path automatically on RDNA4/gfx12.

Verification

Tested on Windows with ROCm PyTorch / gfx1201:

  • pip install --no-build-isolation -v .
  • Built sageattention-2.2.0-cp39-abi3-win_amd64.whl
  • Smoke-tested gfx12 native fp8/fp16 paths
  • Checked causal and non-causal HND/NHD cases
  • Checked uneven Wan-style shape support

@0xDELUXA
Copy link
Copy Markdown

Great to see this here too!

Comment thread sageattention/core.py
# inference step in distributed env for multi gpus inference. This small
# workaround also make sage attention work compatible with torch.compile
# through non-fullgraph compile mode.
torch.cuda.set_device(v.device)
Copy link
Copy Markdown
Owner

@woct0rdho woct0rdho May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need set_device? What I'm sure is that it breaks torch.compile full graph compilation.

I've at least tested that without set_device, SageAttention works on a single Nvidia GPU. I heard some reports that SageAttention does not work on multi-GPU, but I think the proper fix is to set the CUDA stream in the kernels, and I don't know if set_device actually solves it.

@woct0rdho
Copy link
Copy Markdown
Owner

woct0rdho commented May 16, 2026

Let's put back this change to support CUDA 13.2 2131705

(Somehow I can't push to this PR, maybe because my repo is a fork?)

Also, can you confirm that it works with torch.compile full graph compilation on AMD GPU, such as using Kijai's compile node in ComfyUI? (Currently comfy-aimdo does not support full graph compilation and we need to disable it when testing this PR.)

Full graph compilation is worth us supporting. For example, ComfyUI-INT8-Fast needs it to achieve the ideal 2x speedup.

Comment thread sageattention/core.py
Comment on lines 68 to +77
def get_cuda_version():
version = torch.version.cuda
major, minor = version.split('.')
return int(major), int(minor)
try:
output = subprocess.check_output(['nvcc', '--version']).decode()
match = re.search(r'release (\d+)\.(\d+)', output)
if match:
major, minor = int(match.group(1)), int(match.group(2))
return major, minor
except Exception as e:
print("Failed to get CUDA version:", e)
return None, None
Copy link
Copy Markdown
Owner

@woct0rdho woct0rdho May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid callling nvcc, but use something torch provides? Many Windows + Nvidia users may not have nvcc installed. They just use the CUDA DLLs bundled in torch and the ptxas.exe bundled in triton-windows.

Comment thread sageattention/core.py
arch = _cuda_archs[q.device.index]
if arch == "sm75":

arch = get_cuda_arch_versions()[q.device.index]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we restore the logic in https://github.com/woct0rdho/SageAttention/blob/213170591394d2d96f086b490ba914cec614a140/sageattention/core.py : Precompute _cuda_archs = get_cuda_arch_versions() as a global variable, then use it in sageattn? Otherwise it will break torch.compile full graph compilation.

@0xDELUXA
Copy link
Copy Markdown

Also, can you confirm that it works with torch.compile full graph compilation on AMD GPU, such as using Kijai's compile node in ComfyUI? (Currently comfy-aimdo does not support full graph compilation and we need to disable it when testing this PR.)

Full graph compilation is worth us supporting. For example, ComfyUI-INT8-Fast needs it to achieve the ideal 2x speedup.

I'm curious about this "torch.compile full graph compilation", which of Kijai's nodes can be used to try it in ComfyUI?

Aimdo doesn't support it? I'm not really sure what percentage of AMD users currently use the --enable-dynamic-vram startup flag to enable aimdo.

Is this ComfyUI-INT8-Fast node supposed to provide a 2x speedup on AMD as well? I haven't seen anything AMD-related in its docs.

@woct0rdho
Copy link
Copy Markdown
Owner

woct0rdho commented May 16, 2026

Use the TorchCompileModelAdvanced node in ComfyUI-KJNodes, and enable 'fullgraph' in it. You can read more in https://github.com/woct0rdho/SageAttention/releases/tag/v2.2.0-windows.post4

Nowadays aimdo is enabled by default, and you need to pass --disable-dynamic-vram when starting ComfyUI, or tick 'disable_dynamic_vram' in TorchCompileModelAdvanced.

int8 matmul has a theoretical speed up of 2x compared to fp16/bf16 on RDNA4, but no speedup on RDNA3/3.5 , just like SageAttention. (SageAttention optimizes the attention module, and ComfyUI-INT8-Fast optimizes the non-attention linear modules. Their effects can stack up.)

ComfyUI-INT8-Fast uses Triton kernels, and it's another question whether the Triton compiler is well optimized on RDNA4.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented May 16, 2026

Nowadays aimdo is enabled by default, and you need to pass --disable-dynamic-vram when starting ComfyUI, or tick 'disable_dynamic_vram' in TorchCompileModelAdvanced.

Aren't we in the "AMD support remains opt in with --enable-dynamic-vram" state? I'm referring to this.
Also: https://github.com/Comfy-Org/ComfyUI/blob/master/main.py#L218.
I'm quite sure it isn't enabled by default on AMD as of now.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented May 16, 2026

Use the TorchCompileModelAdvanced node in ComfyUI-KJNodes, and enable 'fullgraph' in it. You can read more in https://github.com/woct0rdho/SageAttention/releases/tag/v2.2.0-windows.post4

Here are my local findings (gfx1200, torch 2.13.0a0+rocm7.13.0a20260416, triton-windows 3.6.0+gitae9d5a54.post27):

fullgraph=true error
!!! Exception during processing !!! torch.* op returned non-Tensor
  Explanation: torch.* ops that return a non-Tensor cannot be traced into the Dynamo FX graph output


  Developer debug context: example_value type: int; op: call_function; target: <function device_count at 0x0000018F57F05080>

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0208.html

from user code:
   File "C:\ComfyUI\comfy\ldm\flux\layers.py", line 235, in forward
    attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
  File "C:\ComfyUI\comfy\ldm\flux\math.py", line 14, in attention
    x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask, transformer_options=transformer_options)
  File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 139, in wrapper
    return func(*args, **kwargs)
  File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 569, in attention_sage
    out = sageattn(q, k, v, attn_mask=mask, is_causal=False, tensor_layout=tensor_layout)
  File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 713, in sageattn
    arch = get_cuda_arch_versions()[q.device.index]
  File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 84, in get_cuda_arch_versions
    for i in range(torch.cuda.device_count()):

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Traceback (most recent call last):
  File "C:\ComfyUI\execution.py", line 535, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\execution.py", line 335, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\execution.py", line 309, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "C:\ComfyUI\execution.py", line 297, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "C:\ComfyUI\nodes.py", line 1576, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\nodes.py", line 1541, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\sample.py", line 66, in sample
    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1180, in sample
    return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1070, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1052, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 995, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 981, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 751, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\k_diffusion\sampling.py", line 205, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 400, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 954, in __call__
    return self.outer_predict_noise(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 961, in outer_predict_noise
    ).execute(x, timestep, model_options, seed)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 964, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 380, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 205, in calc_cond_batch
    return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 213, in _calc_cond_batch_outer
    return executor.execute(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 325, in _calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\model_base.py", line 182, in apply_model
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 113, in execute
    return self.wrappers[self.idx](self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 105, in __call__
    return new_executor.execute(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\model_base.py", line 226, in _apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\flux\model.py", line 345, in forward
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\flux\model.py", line 406, in _forward
    out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, timestep_zero_index=timestep_zero_index, transformer_options=transformer_options, attn_mask=kwargs.get("attention_mask", None))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\flux\model.py", line 243, in forward_orig
    img, txt = block(img=img,
               ^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 473, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1058, in compile_wrapper
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor
  Explanation: torch.* ops that return a non-Tensor cannot be traced into the Dynamo FX graph output


  Developer debug context: example_value type: int; op: call_function; target: <function device_count at 0x0000018F57F05080>

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0208.html

from user code:
   File "C:\ComfyUI\comfy\ldm\flux\layers.py", line 235, in forward
    attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
  File "C:\ComfyUI\comfy\ldm\flux\math.py", line 14, in attention
    x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask, transformer_options=transformer_options)
  File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 139, in wrapper
    return func(*args, **kwargs)
  File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 569, in attention_sage
    out = sageattn(q, k, v, attn_mask=mask, is_causal=False, tensor_layout=tensor_layout)
  File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 713, in sageattn
    arch = get_cuda_arch_versions()[q.device.index]
  File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 84, in get_cuda_arch_versions
    for i in range(torch.cuda.device_count()):

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


Prompt executed in 0.49 seconds
fullgraph=false error
!!! Exception during processing !!! ImportError: cannot import name 'GroupName' from 'torch.distributed' (unknown location)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Traceback (most recent call last):
  File "C:\ComfyUI\execution.py", line 535, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\execution.py", line 335, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\execution.py", line 309, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "C:\ComfyUI\execution.py", line 297, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "C:\ComfyUI\nodes.py", line 1576, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\nodes.py", line 1541, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\sample.py", line 66, in sample
    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1180, in sample
    return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1070, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1052, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 995, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 981, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 751, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\k_diffusion\sampling.py", line 205, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 400, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 954, in __call__
    return self.outer_predict_noise(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 961, in outer_predict_noise
    ).execute(x, timestep, model_options, seed)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 964, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 380, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 205, in calc_cond_batch
    return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 213, in _calc_cond_batch_outer
    return executor.execute(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 325, in _calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\model_base.py", line 182, in apply_model
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 113, in execute
    return self.wrappers[self.idx](self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 105, in __call__
    return new_executor.execute(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\model_base.py", line 226, in _apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\flux\model.py", line 345, in forward
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\flux\model.py", line 406, in _forward
    out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, timestep_zero_index=timestep_zero_index, transformer_options=transformer_options, attn_mask=kwargs.get("attention_mask", None))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\flux\model.py", line 243, in forward_orig
    img, txt = block(img=img,
               ^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 473, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1062, in compile_wrapper
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1069, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1049, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1836, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1299, in codegen_and_compile
    torch._dynamo.repro.after_aot.save_graph_repro(
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\repro\after_aot.py", line 805, in save_graph_repro
    distributed_info = _extract_distributed_info(gm)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\repro\after_aot.py", line 158, in _extract_distributed_info
    from torch.distributed import GroupName
torch._inductor.exc.InductorError: ImportError: cannot import name 'GroupName' from 'torch.distributed' (unknown location)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


Prompt executed in 0.99 seconds

I'm not sure about the first error - it could theoretically be fixed to avoid this issue, I'm curious about @jammm's opinion. But I can't even express how annoying these unconditional torch.distributed imports are.

I've opened a few PRs to address this distributed import issue with Windows ROCm across some repositories, but there are still many similar imports remaining, even within PyTorch itself, it seems. This one may require a fix in pytorch/pytorch in the future.

@0xDELUXA
Copy link
Copy Markdown

int8 matmul has a theoretical speed up of 2x compared to fp16/bf16 on RDNA4, but no speedup on RDNA3/3.5 , just like SageAttention. (SageAttention optimizes the attention module, and ComfyUI-INT8-Fast optimizes the non-attention linear modules. Their effects can stack up.)

ComfyUI-INT8-Fast uses Triton kernels, and it's another question whether the Triton compiler is well optimized on RDNA4.

Only using Load Diffusion Model INT8 (W8A8) to load flux-2-klein-9b-int8.safetensors (without TorchCompileModelAdvanced, since it doesn't work at all right now regardless of the settings), I get roughly the same speed as when using the normal Load Diffusion Model to load flux-2-klein-9b-fp8.safetensors in my workflow. Not sure whether there should actually be any difference here or not.

@0xDELUXA
Copy link
Copy Markdown

0xDELUXA commented May 16, 2026

Managed to prevent compilation with fullgraph=false from triggering distributed imports by making them conditional. Now I’m seeing these warnings:

C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\variables\functions.py:2311: UserWarning: Dynamo does not know how to trace the builtin `sageattention._fused.pybind11_detail_function_record_v1_msvc_md_mscver19.quant_per_block_int8_fuse_sub_mean_cuda.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\variables\functions.py:2311: UserWarning: Dynamo does not know how to trace the builtin `sageattention._qattn_gfx12_native.pybind11_detail_function_record_v1_msvc_md_mscver19.transpose_value_fp8_scaled_hnd.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\variables\functions.py:2311: UserWarning: Dynamo does not know how to trace the builtin `sageattention._qattn_gfx12_native.pybind11_detail_function_record_v1_msvc_md_mscver19.qk_rawq_int8_sv_f8_scaled_native_attn.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants