[diffusion] kernel fusion: scale residual norm scale shift and add gate norm kernel fusion for Z-Image by linfann · Pull Request #19249 · sgl-project/sglang

linfann · 2026-02-24T13:52:55Z

Motivation

Optimize Z-Image via kernel fusion (refer to #14717 ).

This PR:

fused residual, gating, RMSNorm, and scale/shift into one single CUDA kernel named fused_norm_residual_gate_add_norm_scale
fused gating, RMSNorm, add into one single CUDA kernel named fused_add_gate_norm.

The kernel fusion reduces kernel launch overhead for Z-Image.

Modifications

In sgl_kernel, add the fused_norm_residual_gate_add_norm_scale and fused_add_gate_norm CUDA kernels based on CUTLASS.
In layers/layernorm.py, update the LayerNorm path to call the fused kernels when available.
In the DiT models, update the Z-Image implementations to use the fused kernels in layernorm.py.
Support both Layernorm and RMSNorm.

Accuracy Tests

Benchmarking and Profiling

Benchmark

bench_fused_norm_residual_gate_add_norm_scale

    B     S     D norm_type  affine  SGLang Native  SGLang Fused
0   1   128  1024     layer    True      90.064000     13.632000
1   1   128  1024     layer   False      48.128001     10.720000
2   1   128  1024       rms    True      28.160000      9.504000
3   1   128  1024       rms   False      27.807999      9.248000
4   1   128  3072     layer    True      82.416002     13.312000
5   1   128  3072     layer   False      56.832001     68.928003
6   1   128  3072       rms    True      74.767999     12.320000
7   1   128  3072       rms   False      30.848000     11.360000
8   1   128  4096     layer    True      86.815998     12.832000
9   1   128  4096     layer   False      60.704000    131.568000
10  1   128  4096       rms    True      74.111998    145.103998
11  1   128  4096       rms   False      33.472002     13.216000
12  1  1024  1024     layer    True      93.823999     15.648000
13  1  1024  1024     layer   False      70.560001     13.808000
14  1  1024  1024       rms    True      37.888002     14.592000
15  1  1024  1024       rms   False      37.728000     15.616000
16  1  1024  3072     layer    True     150.496006     33.792000
17  1  1024  3072     layer   False     124.384001     31.231999
18  1  1024  3072       rms    True      65.103997     30.944001
19  1  1024  3072       rms   False      65.024003     31.872001
20  1  1024  4096     layer    True     187.424004     39.296001
21  1  1024  4096     layer   False     159.615993     36.928002
22  1  1024  4096       rms    True      79.696003     36.864001
23  1  1024  4096       rms   False      79.552002     37.055999
24  1  4096  1024     layer    True     184.704006     38.112000
25  1  4096  1024     layer   False     160.735995     36.192000
26  1  4096  1024       rms    True     127.776004    146.560006
27  1  4096  1024       rms   False      81.728000     36.256000
28  1  4096  3072     layer    True     619.840026     90.623997
29  1  4096  3072     layer   False     593.840003     87.168001
30  1  4096  3072       rms    True     261.664003    129.600003
31  1  4096  3072       rms   False     261.552006     87.647997
32  1  4096  4096     layer    True     815.247983    113.680001
33  1  4096  4096     layer   False     786.400020    108.768001
34  1  4096  4096       rms    True     335.711986    111.167997
35  1  4096  4096       rms   False     335.263997    110.816002

bench_fused_add_gate_norm

    B     S     D norm_type  affine  SGLang Native  SGLang Fused
0   1   128  1024     layer    True     119.488001      8.704000
1   1   128  1024     layer   False      28.063999      8.160000
2   1   128  1024       rms    True      17.055999      8.352000
3   1   128  1024       rms   False      16.960001      8.160000
4   1   128  3072     layer    True      45.887999     10.048000
5   1   128  3072     layer   False      32.224000      9.760000
6   1   128  3072       rms    True      18.975999     10.944000
7   1   128  3072       rms   False      19.936001     11.072000
8   1   128  4096     layer    True      46.912000     11.680000
9   1   128  4096     layer   False      34.336001      9.984000
10  1   128  4096       rms    True      21.152001     11.424000
11  1   128  4096       rms   False      19.743999     10.944000
12  1  1024  1024     layer    True      51.775999     12.832000
13  1  1024  1024     layer   False      40.991999     11.728000
14  1  1024  1024       rms    True      23.584001     11.744000
15  1  1024  1024       rms   False      23.615999     11.648000
16  1  1024  3072     layer    True      83.424002     22.560000
17  1  1024  3072     layer   False      70.271999     22.080000
18  1  1024  3072       rms    True      40.320002     22.128000
19  1  1024  3072       rms   False      40.256001     21.984000
20  1  1024  4096     layer    True     102.367997     28.576000
21  1  1024  4096     layer   False      88.384002     26.575999
22  1  1024  4096       rms    True      48.512001     26.880000
23  1  1024  4096       rms   False      48.735999     27.071999
24  1  4096  1024     layer    True     102.319997     27.519999
25  1  4096  1024     layer   False      89.567997     25.839999
26  1  4096  1024       rms    True      48.512001     26.688000
27  1  4096  1024       rms   False      48.544001     26.335999
28  1  4096  3072     layer    True     324.160010     58.752000
29  1  4096  3072     layer   False     312.032014     56.896001
30  1  4096  3072       rms    True     146.847993     56.800000
31  1  4096  3072       rms   False     145.536005     56.768000
32  1  4096  4096     layer    True     425.087988     74.784003
33  1  4096  4096     layer   False     411.520004     72.672002
34  1  4096  4096       rms    True     186.207995     72.576001
35  1  4096  4096       rms   False     187.296003     73.055997

Profiling

no compile

Command:

sglang generate --model-path=Tongyi-MAI/Z-Image-Turbo  --prompt="A cool sports car with pop-up headlights, huge wide body, large rear wing, race track scene, mixed light and shadow, dynamic tracking, clear and crisp lighting, ultra-detailed, cinematic, dynamic motion blur, 8K, high quality." --negative-prompt=" "   --width=1024   --height=1024   --seed=42   --save-output  --dit-cpu-offload=false   --text-encoder-cpu-offload=false --perf-dump-path {case}.json --warmup

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	2476.57 ms	2347.77 ms	-128.80 ms (-5.2%)	✅
Throughput	0.40 req/s	0.43 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	New (ms)	Diff (ms)	Diff (%)	Status
InputValidationStage	0.07	0.05	-0.02	-23.3%	⚪️
TextEncodingStage	254.87	253.86	-1.01	-0.4%	⚪️
LatentPreparationStage	0.28	0.23	-0.05	-17.9%	⚪️
TimestepPreparationStage	0.82	0.48	-0.34	-41.4%	⚪️
DenoisingStage	2148.42	1974.39	-174.03	-8.1%	⚪️
DecodingStage	67.40	114.89	+47.49	+70.5%	⚪️

compile

Command:

sglang generate --model-path=Tongyi-MAI/Z-Image-Turbo  --prompt="A cool sports car with pop-up headlights, huge wide body, large rear wing, race track scene, mixed light and shadow, dynamic tracking, clear and crisp lighting, ultra-detailed, cinematic, dynamic motion blur, 8K, high quality." --negative-prompt=" "   --width=1024   --height=1024   --seed=42   --save-output  --dit-cpu-offload=false   --text-encoder-cpu-offload=false --perf-dump-path {case}.json --warmup --enable-torch-compile

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	2952.24 ms	2845.79 ms	-106.45 ms (-3.6%)	✅
Throughput	0.34 req/s	0.35 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	New (ms)	Diff (ms)	Diff (%)	Status
InputValidationStage	0.04	0.06	+0.01	+27.9%	⚪️
TextEncodingStage	253.18	257.79	+4.62	+1.8%	⚪️
LatentPreparationStage	0.15	0.23	+0.09	+59.6%	⚪️
TimestepPreparationStage	0.35	0.49	+0.14	+40.8%	⚪️
DenoisingStage	2686.23	2573.75	-112.48	-4.2%	⚪️
DecodingStage	8.53	9.40	+0.88	+10.3%	⚪️

Confusion：--enable-torch-compile cost more time

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

…te norm kernel fusion for Z-Image

gemini-code-assist · 2026-02-24T13:53:18Z

Summary of Changes

Hello @linfann, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance optimizations for Z-Image diffusion models by implementing two new CUDA fused kernels. These kernels combine several common operations like residual connections, gating, and normalization (LayerNorm and RMSNorm) into single, more efficient GPU operations. The integration aims to reduce kernel launch overhead and has shown promising end-to-end latency improvements, particularly in the denoising phase. While overall performance has improved, a regression in the decoding stage's latency has been identified and is slated for future work.

Highlights

New CUDA Fused Kernels: Introduced two new CUDA kernels, fused_norm_residual_gate_add_norm_scale and fused_add_gate_norm, built with CUTLASS, to optimize diffusion model operations.
Performance Optimization for Z-Image: These kernel fusions are specifically designed to reduce kernel launch overhead for Z-Image models, leading to a 5.2% reduction in end-to-end latency and an 8.1% improvement in the DenoisingStage.
Integration into LayerNorm Module: The layernorm.py module has been updated to incorporate these fused kernels, providing optimized paths for both LayerNorm and RMSNorm types.
Z-Image Model Updates: DiT models, particularly Z-Image implementations, now utilize these new fused kernels for their normalization and residual operations.
Known Performance Regression: An increase in latency (70.5%) has been observed in the DecodingStage, which is noted as a 'Todo' item for further investigation and resolution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/jit_kernel/diffusion/cutedsl/add_gate_norm.py
- Added a new CUDA kernel file for fused_add_gate_norm operation.
- Implemented AddGateNorm class with make_hash_key for compile-time specialization and a kernel for CUDA execution.
- Included validation functions for input tensors (x, residual, gate, weight, bias).
- Registered fused_add_gate_norm as a custom PyTorch operation.
python/sglang/jit_kernel/diffusion/cutedsl/norm_residual_gate_add_norm_scale.py
- Added a new CUDA kernel file for fused_norm_residual_gate_add_norm_scale operation.
- Implemented NormResidualGateAddNormScale class with make_hash_key for compile-time specialization and a kernel for CUDA execution.
- Included validation functions for input tensors (x, residual, gate, weight1, bias1, weight2, bias2, scale).
- Registered fused_norm_residual_gate_add_norm_scale as a custom PyTorch operation.
python/sglang/multimodal_gen/configs/models/dits/zimage.py
- Updated the ZImageArchConfig to remap existing ffn_norm2, attention_norm2, and ffn_norm1 layers to the new fused kernel names (fused_add_gate_norm.norm and fused_norm_residual_gate_add_norm_scale.norm1/norm2).
python/sglang/multimodal_gen/runtime/layers/layernorm.py
- Imported init_logger for logging.
- Added _NormResidualGateAddNormScale base class and its LayerNormResidualGateAddNormScale and RMSNormResidualGateAddNormScale implementations to wrap the fused_norm_residual_gate_add_norm_scale CUDA kernel.
- Added _AddGateNorm base class and its AddGateLayerNorm and AddGateRMSNorm implementations to wrap the fused_add_gate_norm CUDA kernel.
- Implemented forward_cuda and forward_native methods for both new fused norm classes, with forward_hip falling back to native.
python/sglang/multimodal_gen/runtime/models/dits/zimage.py
- Imported the new fused norm classes: AddGateRMSNorm and RMSNormResidualGateAddNormScale.
- Modified the DiTBlock initialization to conditionally instantiate fused_norm_residual_gate_add_norm_scale and fused_add_gate_norm when modulation is enabled, replacing the individual attention_norm2, ffn_norm1, and ffn_norm2.
- Updated the forward method within DiTBlock to utilize the new fused norm operations when modulation is active, streamlining the computation flow between attention and FFN blocks.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two new fused CUDA kernels, fused_norm_residual_gate_add_norm_scale and fused_add_gate_norm, to optimize the Z-Image model by reducing kernel launch overhead. The changes look promising and show a good performance improvement. I've identified a few issues, including a bug in the tensor validation logic, incorrect docstrings for the new kernels, and a bug in the fallback logic for the CUDA implementation. Addressing these will improve the correctness and maintainability of the code.

gemini-code-assist · 2026-02-24T13:56:48Z

python/sglang/jit_kernel/diffusion/cutedsl/add_gate_norm.py

+    elif t.ndim == 4 and (t.shape[0] != B or t.shape[2] != 1 or t.shape[3] != D):
+        F = t.shape[1]
+        if S % F != 0:
+            raise ValueError(f"Validate failed: S({S}) must be divisible by F({F}).")
+        failed = True


The validation logic for 4D tensors is flawed. The check if S % F != 0: is only performed if the elif condition on line 234 is met. However, for a valid shape like (B, F, 1, D), this condition is false, so the divisibility of S by F is never checked. This could lead to runtime errors if an invalid tensor is passed. This same issue exists in the duplicated validate_scale_shift function in python/sglang/jit_kernel/diffusion/cutedsl/norm_residual_gate_add_norm_scale.py.

Suggested change

elif t.ndim == 4 and (t.shape[0] != B or t.shape[2] != 1 or t.shape[3] != D):

F = t.shape[1]

if S % F != 0:

raise ValueError(f"Validate failed: S({S}) must be divisible by F({F}).")

failed = True

elif t.ndim == 4:

F = t.shape[1]

if S % F != 0:

raise ValueError(f"Validate failed: S({S}) must be divisible by F({F}).")

if t.shape[0] != B or t.shape[2] != 1 or t.shape[3] != D:

failed = True

gemini-code-assist · 2026-02-24T13:56:49Z

python/sglang/multimodal_gen/runtime/layers/layernorm.py

+        gate: torch.Tensor | int,
+        scale: torch.Tensor,
+    ) -> torch.Tensor:
+        if x.shape[-1] % 256 != 0 and x.shape[-1] <= 8192:


The condition to fall back to the native implementation is incorrect. It uses and where it should use or, and the upper bound check is x.shape[-1] <= 8192 instead of x.shape[-1] > 8192. This could cause the fused kernel to be called with unsupported dimensions (e.g., D > 8192), leading to a ValueError from the kernel instead of a graceful fallback. A similar bug exists in _AddGateNorm.forward_cuda on line 636.

Suggested change

if x.shape[-1] % 256 != 0 and x.shape[-1] <= 8192:

if x.shape[-1] % 256 != 0 or x.shape[-1] > 8192:

gemini-code-assist · 2026-02-24T13:56:49Z

python/sglang/jit_kernel/diffusion/cutedsl/add_gate_norm.py

+    """
+    Fuse: norm(x) * (1 + scale) + shift
+      where norm is either layernorm or rmsnorm.
+
+    Expects:
+      - x: [B, S, D]
+      - weight/bias: None, [D]
+      - scale/shift: [1], [D], [1/B, D], [1/B, 1/S, D] or [B, F, 1, D]
+      - norm_type: str, "layer" or "rms"
+      - eps: Optional[float], default: 1e-5
+


The docstring for fused_add_gate_norm appears to be incorrect. It describes a norm(x) * (1 + scale) + shift operation and expects scale/shift parameters, which doesn't match the function's implementation or signature. The actual operation is x + gate * norm(residual). Please update the docstring to accurately reflect the function's behavior and parameters.

Suggested change

"""

Fuse: norm(x) * (1 + scale) + shift

where norm is either layernorm or rmsnorm.

Expects:

- x: [B, S, D]

- weight/bias: None, [D]

- scale/shift: [1], [D], [1/B, D], [1/B, 1/S, D] or [B, F, 1, D]

- norm_type: str, "layer" or "rms"

- eps: Optional[float], default: 1e-5

"""

Fuse: x + gate * norm(residual)

where norm is either layernorm or rmsnorm.

Expects:

- x: [B, S, D]

- residual: [B, S, D]

- gate: None, or broadcastable to [B, S, D]

- weight/bias: None, [D]

- norm_type: str, "layer" or "rms"

- eps: Optional[float], default: 1e-5

gemini-code-assist · 2026-02-24T13:56:49Z

python/sglang/jit_kernel/diffusion/cutedsl/norm_residual_gate_add_norm_scale.py

+    """
+    Fuse: norm(x) * (1 + scale) + shift
+      where norm is either layernorm or rmsnorm.
+
+    Expects:
+      - x: [B, S, D]
+      - weight/bias: None, [D]
+      - scale/shift: [1], [D], [1/B, D], [1/B, 1/S, D] or [B, F, 1, D]
+      - norm_type: str, "layer" or "rms"
+      - eps: Optional[float], default: 1e-5
+


The docstring for fused_norm_residual_gate_add_norm_scale seems incorrect. It describes the operation as norm(x) * (1 + scale) + shift, which doesn't match the implementation. The function actually computes norm_out = norm2(x + gate * norm1(residual)) * (scale + 1) and also returns the intermediate residual_out. The Expects section also incorrectly mentions shift. Please update the docstring for clarity.

""" Fuse: residual_out = x + gate * norm1(residual) norm_out = norm2(residual_out) * (scale + 1.0) where norm is either layernorm or rmsnorm. Returns: A tuple of (norm_out, residual_out). Expects: - x: [B, S, D] - residual: [B, S, D] - gate: None, or broadcastable to [B, S, D] - weight1/bias1: None, [D] for norm1 - weight2/bias2: None, [D] for norm2 - scale: broadcastable to [B, S, D] - norm_type: str, "layer" or "rms" - eps: Optional[float], default: 1e-5

gemini-code-assist · 2026-02-24T13:56:49Z

python/sglang/multimodal_gen/runtime/layers/layernorm.py

+                stacklevel=2,
+            )
+            return self.forward_native(residual, x, gate, scale)
+        # todo use fused kernel


This todo comment seems to be stale as the fused kernel is already being used in the lines below. It should be removed. A similar stale comment exists on line 644.

gemini-code-assist · 2026-02-24T13:56:49Z

python/sglang/multimodal_gen/runtime/layers/layernorm.py

+        x: torch.Tensor,
+        gate: torch.Tensor | int,
+    ) -> torch.Tensor:
+        logger.info("### use cuda fused_add_gate_norm")


This logger.info call appears to be for debugging. To avoid verbose logs in production, please consider removing it or changing it to logger.debug. A similar log message exists on line 676.

Suggested change

logger.info("### use cuda fused_add_gate_norm")

logger.debug("### use cuda fused_add_gate_norm")

…m_linfan

HydraQYH · 2026-03-01T13:51:51Z

python/sglang/jit_kernel/diffusion/cutedsl/add_gate_norm.py

+    def __init__(self, D: int, norm_type: str):
+        self.D = D
+        self.norm_type = norm_type  # "layer" or "rms"
+        self.num_warps = self.D // 256  # num of warps per cta


This means that each warp processes 256 elements, which means each CUDA thread processes 8 elements, is that correct?

@linfann I think there are two points worth discussing here:

Shouldn't this be a ceil_div?

Blackwell supports 256-bit ld/st, and a warp can process 512 elements.

This means that each warp processes 256 elements, which means each CUDA thread processes 8 elements, is that correct?

This kernel assumes that the reduction dimension (N) is a multiple of 256. Fixing the number of warps and forcing each warp to process exactly 256 elements is not the optimal strategy for performance. Instead, we only need to assume that N is a multiple of 8 and use predicates to avoid unnecessary load/store/compute operations like this.

Since these kernels share the same norm template, this change will also affect another kernel. @linfann You can try implementing it like this. If it is too complicated, we can leave it for me to handle in the next PR.

HydraQYH · 2026-03-01T13:56:40Z

python/sglang/jit_kernel/diffusion/cutedsl/add_gate_norm.py

+        @cute.jit
+        def copy_if(src, dst):
+            if cutlass.const_expr(
+                isinstance(src, cute.Tensor) and isinstance(src, cute.Tensor)


Yes, this typo also appears in the referenced PR.

yingluosanqian · 2026-03-01T15:27:05Z

In ZImage there are two execution paths: one with modulation=True and another with modulation=False. I was expecting this PR to implement the latter. The former has already been addressed in a previous PR.

we need a new kernel to fuse two norm operations.
for kernel norm + add, we might be able to reuse the implementation of the norm scale shift kernel. we could allow shift to be None, and change the computation from * (1 + scale) to * (0 + scale).

linfann · 2026-03-02T01:51:53Z

In ZImage there are two execution paths: one with modulation=True and another with modulation=False. I was expecting this PR to implement the latter. The former has already been addressed in a previous PR.

we need a new kernel to fuse two norm operations.

for kernel norm + add, we might be able to reuse the implementation of the norm scale shift kernel. we could allow shift to be None, and change the computation from * (1 + scale) to * (0 + scale).

OK，I will implement the latter one

DarkSharpness · 2026-03-02T04:25:38Z

python/sglang/jit_kernel/diffusion/cutedsl/norm_residual_gate_add_norm_scale.py

+_COMPILE_CACHE = {}
+
+
+def to_cute_arg(


can we make this function a common utility function?

[diffusion] kernel fusion: scale residual norm scale shift and add ga…

583b7f1

…te norm kernel fusion for Z-Image

github-actions bot added the diffusion SGLang Diffusion label Feb 24, 2026

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

linfann added 3 commits February 27, 2026 21:30

Merge branch 'main' of github.com:linfann/sglang into fuse_zimage_nor…

28704e4

…m_linfan

fix code

4f59b4c

add benchmark and test

3ee3e75

linfann force-pushed the fuse_zimage_norm_linfan branch from efbe6d1 to 3ee3e75 Compare March 1, 2026 10:27

linfann marked this pull request as ready for review March 1, 2026 10:46

linfann requested review from BBuf, DarkSharpness, HydraQYH, celve, mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners March 1, 2026 10:46

linfann changed the title ~~[WIP][diffusion] kernel fusion: scale residual norm scale shift and add gate norm kernel fusion for Z-Image~~ [diffusion] kernel fusion: scale residual norm scale shift and add gate norm kernel fusion for Z-Image Mar 1, 2026

HydraQYH reviewed Mar 1, 2026

View reviewed changes

DarkSharpness reviewed Mar 2, 2026

View reviewed changes

	if x.shape[-1] % 256 != 0 and x.shape[-1] <= 8192:
	if x.shape[-1] % 256 != 0 or x.shape[-1] > 8192:

	logger.info("### use cuda fused_add_gate_norm")
	logger.debug("### use cuda fused_add_gate_norm")

Conversation

linfann commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Benchmark

bench_fused_norm_residual_gate_add_norm_scale

bench_fused_add_gate_norm

Profiling

no compile

1. High-level Summary

2. Stage Breakdown

compile

1. High-level Summary

2. Stage Breakdown

Checklist

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

HydraQYH Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

HydraQYH Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

yingluosanqian Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HydraQYH Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

linfann Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

yingluosanqian commented Mar 1, 2026

Uh oh!

linfann commented Mar 2, 2026

Uh oh!

DarkSharpness Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linfann commented Feb 24, 2026 •

edited

Loading

yingluosanqian Mar 1, 2026 •

edited

Loading