Port gfx12 native attention to ABI3 fork#92
Conversation
|
Great to see this here too! |
| # inference step in distributed env for multi gpus inference. This small | ||
| # workaround also make sage attention work compatible with torch.compile | ||
| # through non-fullgraph compile mode. | ||
| torch.cuda.set_device(v.device) |
There was a problem hiding this comment.
Are you sure we need set_device? What I'm sure is that it breaks torch.compile full graph compilation.
I've at least tested that without set_device, SageAttention works on a single Nvidia GPU. I heard some reports that SageAttention does not work on multi-GPU, but I think the proper fix is to set the CUDA stream in the kernels, and I don't know if set_device actually solves it.
|
Let's put back this change to support CUDA 13.2 2131705 (Somehow I can't push to this PR, maybe because my repo is a fork?) Also, can you confirm that it works with torch.compile full graph compilation on AMD GPU, such as using Kijai's compile node in ComfyUI? (Currently comfy-aimdo does not support full graph compilation and we need to disable it when testing this PR.) Full graph compilation is worth us supporting. For example, ComfyUI-INT8-Fast needs it to achieve the ideal 2x speedup. |
| def get_cuda_version(): | ||
| version = torch.version.cuda | ||
| major, minor = version.split('.') | ||
| return int(major), int(minor) | ||
| try: | ||
| output = subprocess.check_output(['nvcc', '--version']).decode() | ||
| match = re.search(r'release (\d+)\.(\d+)', output) | ||
| if match: | ||
| major, minor = int(match.group(1)), int(match.group(2)) | ||
| return major, minor | ||
| except Exception as e: | ||
| print("Failed to get CUDA version:", e) | ||
| return None, None |
There was a problem hiding this comment.
Can we avoid callling nvcc, but use something torch provides? Many Windows + Nvidia users may not have nvcc installed. They just use the CUDA DLLs bundled in torch and the ptxas.exe bundled in triton-windows.
| arch = _cuda_archs[q.device.index] | ||
| if arch == "sm75": | ||
|
|
||
| arch = get_cuda_arch_versions()[q.device.index] |
There was a problem hiding this comment.
Can we restore the logic in https://github.com/woct0rdho/SageAttention/blob/213170591394d2d96f086b490ba914cec614a140/sageattention/core.py : Precompute _cuda_archs = get_cuda_arch_versions() as a global variable, then use it in sageattn? Otherwise it will break torch.compile full graph compilation.
I'm curious about this "torch.compile full graph compilation", which of Kijai's nodes can be used to try it in ComfyUI? Aimdo doesn't support it? I'm not really sure what percentage of AMD users currently use the Is this ComfyUI-INT8-Fast node supposed to provide a 2x speedup on AMD as well? I haven't seen anything AMD-related in its docs. |
|
Use the TorchCompileModelAdvanced node in ComfyUI-KJNodes, and enable 'fullgraph' in it. You can read more in https://github.com/woct0rdho/SageAttention/releases/tag/v2.2.0-windows.post4 Nowadays aimdo is enabled by default, and you need to pass int8 matmul has a theoretical speed up of 2x compared to fp16/bf16 on RDNA4, but no speedup on RDNA3/3.5 , just like SageAttention. (SageAttention optimizes the attention module, and ComfyUI-INT8-Fast optimizes the non-attention linear modules. Their effects can stack up.) ComfyUI-INT8-Fast uses Triton kernels, and it's another question whether the Triton compiler is well optimized on RDNA4. |
Aren't we in the "AMD support remains opt in with --enable-dynamic-vram" state? I'm referring to this. |
Here are my local findings ( fullgraph=true error!!! Exception during processing !!! torch.* op returned non-Tensor
Explanation: torch.* ops that return a non-Tensor cannot be traced into the Dynamo FX graph output
Developer debug context: example_value type: int; op: call_function; target: <function device_count at 0x0000018F57F05080>
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0208.html
from user code:
File "C:\ComfyUI\comfy\ldm\flux\layers.py", line 235, in forward
attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
File "C:\ComfyUI\comfy\ldm\flux\math.py", line 14, in attention
x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask, transformer_options=transformer_options)
File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 139, in wrapper
return func(*args, **kwargs)
File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 569, in attention_sage
out = sageattn(q, k, v, attn_mask=mask, is_causal=False, tensor_layout=tensor_layout)
File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 713, in sageattn
arch = get_cuda_arch_versions()[q.device.index]
File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 84, in get_cuda_arch_versions
for i in range(torch.cuda.device_count()):
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Traceback (most recent call last):
File "C:\ComfyUI\execution.py", line 535, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\execution.py", line 335, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\execution.py", line 309, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "C:\ComfyUI\execution.py", line 297, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "C:\ComfyUI\nodes.py", line 1576, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\nodes.py", line 1541, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\sample.py", line 66, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 1180, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 1070, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 1052, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 995, in outer_sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 981, in inner_sample
samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 751, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\k_diffusion\sampling.py", line 205, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 400, in __call__
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 954, in __call__
return self.outer_predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 961, in outer_predict_noise
).execute(x, timestep, model_options, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 964, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 380, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 205, in calc_cond_batch
return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 213, in _calc_cond_batch_outer
return executor.execute(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 325, in _calc_cond_batch
output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\model_base.py", line 182, in apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 113, in execute
return self.wrappers[self.idx](self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 105, in __call__
return new_executor.execute(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\model_base.py", line 226, in _apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\ldm\flux\model.py", line 345, in forward
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\ldm\flux\model.py", line 406, in _forward
out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, timestep_zero_index=timestep_zero_index, transformer_options=transformer_options, attn_mask=kwargs.get("attention_mask", None))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\ldm\flux\model.py", line 243, in forward_orig
img, txt = block(img=img,
^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 473, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1058, in compile_wrapper
raise e.with_traceback(None) from e.__cause__ # User compiler error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor
Explanation: torch.* ops that return a non-Tensor cannot be traced into the Dynamo FX graph output
Developer debug context: example_value type: int; op: call_function; target: <function device_count at 0x0000018F57F05080>
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0208.html
from user code:
File "C:\ComfyUI\comfy\ldm\flux\layers.py", line 235, in forward
attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
File "C:\ComfyUI\comfy\ldm\flux\math.py", line 14, in attention
x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask, transformer_options=transformer_options)
File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 139, in wrapper
return func(*args, **kwargs)
File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 569, in attention_sage
out = sageattn(q, k, v, attn_mask=mask, is_causal=False, tensor_layout=tensor_layout)
File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 713, in sageattn
arch = get_cuda_arch_versions()[q.device.index]
File "C:\ComfyUI\venv\Lib\site-packages\sageattention\core.py", line 84, in get_cuda_arch_versions
for i in range(torch.cuda.device_count()):
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Prompt executed in 0.49 secondsfullgraph=false error!!! Exception during processing !!! ImportError: cannot import name 'GroupName' from 'torch.distributed' (unknown location)
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Traceback (most recent call last):
File "C:\ComfyUI\execution.py", line 535, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\execution.py", line 335, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\execution.py", line 309, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "C:\ComfyUI\execution.py", line 297, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "C:\ComfyUI\nodes.py", line 1576, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\nodes.py", line 1541, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\sample.py", line 66, in sample
samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 1180, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 1070, in sample
return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 1052, in sample
output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 995, in outer_sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 981, in inner_sample
samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 751, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\k_diffusion\sampling.py", line 205, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 400, in __call__
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 954, in __call__
return self.outer_predict_noise(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 961, in outer_predict_noise
).execute(x, timestep, model_options, seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 964, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 380, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 205, in calc_cond_batch
return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 213, in _calc_cond_batch_outer
return executor.execute(model, conds, x_in, timestep, model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\samplers.py", line 325, in _calc_cond_batch
output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\model_base.py", line 182, in apply_model
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 113, in execute
return self.wrappers[self.idx](self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy_api\torch_helpers\torch_compile.py", line 26, in apply_torch_compile_wrapper
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 105, in __call__
return new_executor.execute(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\model_base.py", line 226, in _apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\ldm\flux\model.py", line 345, in forward
return comfy.patcher_extension.WrapperExecutor.new_class_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
return self.original(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\ldm\flux\model.py", line 406, in _forward
out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, timestep_zero_index=timestep_zero_index, transformer_options=transformer_options, attn_mask=kwargs.get("attention_mask", None))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\comfy\ldm\flux\model.py", line 243, in forward_orig
img, txt = block(img=img,
^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 473, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1778, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1789, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1062, in compile_wrapper
raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1069, in _compile_fx_inner
raise InductorError(e, currentframe()).with_traceback(
File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1049, in _compile_fx_inner
mb_compiled_graph = fx_codegen_and_compile(
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1836, in fx_codegen_and_compile
return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_inductor\compile_fx.py", line 1299, in codegen_and_compile
torch._dynamo.repro.after_aot.save_graph_repro(
File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\repro\after_aot.py", line 805, in save_graph_repro
distributed_info = _extract_distributed_info(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\repro\after_aot.py", line 158, in _extract_distributed_info
from torch.distributed import GroupName
torch._inductor.exc.InductorError: ImportError: cannot import name 'GroupName' from 'torch.distributed' (unknown location)
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Prompt executed in 0.99 secondsI'm not sure about the first error - it could theoretically be fixed to avoid this issue, I'm curious about @jammm's opinion. But I can't even express how annoying these unconditional I've opened a few PRs to address this distributed import issue with Windows ROCm across some repositories, but there are still many similar imports remaining, even within PyTorch itself, it seems. This one may require a fix in
|
Only using |
|
Managed to prevent compilation with C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\variables\functions.py:2311: UserWarning: Dynamo does not know how to trace the builtin `sageattention._fused.pybind11_detail_function_record_v1_msvc_md_mscver19.quant_per_block_int8_fuse_sub_mean_cuda.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\variables\functions.py:2311: UserWarning: Dynamo does not know how to trace the builtin `sageattention._qattn_gfx12_native.pybind11_detail_function_record_v1_msvc_md_mscver19.transpose_value_fp8_scaled_hnd.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
C:\ComfyUI\venv\Lib\site-packages\torch\_dynamo\variables\functions.py:2311: UserWarning: Dynamo does not know how to trace the builtin `sageattention._qattn_gfx12_native.pybind11_detail_function_record_v1_msvc_md_mscver19.qk_rawq_int8_sv_f8_scaled_native_attn.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints)) |
Main PR description can be seen at thu-ml#368
Summary
This ports the gfx12/RDNA4 SageAttention native extension work onto the
abi3_stablefork branch.Fork-specific changes:
cp39-abi3wheel.TORCH_LIBRARY/torch.opsloading support for the gfx12 native kernels.gfx12.Verification
Tested on Windows with ROCm PyTorch /
gfx1201:pip install --no-build-isolation -v .sageattention-2.2.0-cp39-abi3-win_amd64.whl