[WIP] Support torch compile based pass manager framework by yuan-luo · Pull Request #10987 · sgl-project/sglang

yuan-luo · 2025-09-27T12:20:07Z

Motivation

This PR is aimed to construct a torch compile based pass manager framework in SGLang. The idea and ground are from https://blog.vllm.ai/2025/08/20/torch-compile.html

In the end it will integrate the functionalities including:

Piecewise CUDA Graphs
Not all operations are compatible with CUDA Graphs; for example, cascade attention is not. SGLang works around this by breaking the captured graph into CUDA Graph -safe and -unsafe parts and executing them separately. This will give us the performance benefits of CUDA Graphs without losing correctness.

Custom Compiler Passes in SGLang
While torch.compile includes many built-in optimizations, SGLang adds custom compiler passes that apply additional optimizations to further improve performance.

Fusion passes:
RMSNorm + Quant (FP8) fusion
SiLU-Mul + Quant (FP8) fusion
Attention + Quant (FP8) fusion (up to 7% improvement)
AllReduce + RMSNorm fusion (up to 15% improvement)
AllReduce + RMSNorm + Quant (FP8) fusion (up to 8% improvement)
AllReduce + RMSNorm + Quant (FP4) fusion (up to 10% improvement)
Sequence Parallelism & Async TP (up to 10% improvement)
Other passes:
No-op Elimination: eliminates or simplifies redundant reshape operations
Fix Functionalization: manually reinplaces auto_functionalized operations to avoid redundant copies and memory use

Most code pieces are adapted from vLLM torch compile framework with significant SGLang local customization since both frameworks are so different.

This is a WIP PR. Will refine and sharpen the code gradually. Will add vLLM credit in code references.

Modifications

Accuracy Tests

Benchmarking and Profiling

Due to the PR is not ready, I benchmarked vLLM PIECEWISE CUDA GRAPH in prefill.
DeepSeek-R1, TP8, 4k in 1 out. It seems PIECEWISE CUDA Graph feature does not perform very well in large model like DeepSeek-R1

vllm serve --model="deepseek-ai/DeepSeek-R1" --max-num-seqs 512 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --port 9256 --disable-log-requests --no-enable-prefix-caching -O '{"full_cuda_graph": true}' --cuda-graph-sizes 16 32 48 64 96 128 160 192 256 512 -O.cudagraph_mode=FULL_AND_PIECEWISE

vllm bench serve --backend openai \
    --base-url http://127.0.0.1:9256 \
    --dataset-name=random \
    --random-input-len 4000 --random-output-len 1 \
    --model deepseek-ai/DeepSeek-R1 \
    --seed 12345

With PIECEWISE CUDA GRAPH:
throughput 22151 tok/s

Close PIECEWISE CUDA GRAPH:
throughput 22113 tok/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-27T12:20:30Z

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a robust and flexible compilation system in SGLang by integrating torch.compile and a custom pass manager. The primary goal is to unlock significant performance improvements through advanced graph optimizations, including specialized fusion kernels and efficient handling of distributed operations. This initiative aims to enhance the overall efficiency and scalability of SGLang's model execution.

Highlights

New Torch Compile Pass Manager Framework: Introduced a new framework for managing compilation passes based on torch.compile, enabling advanced optimizations within SGLang. This framework is inspired by vLLM's compilation approach.
Integration of Custom Compiler Passes: The framework supports custom compiler passes, including various fusion passes (e.g., RMSNorm + Quant, SiLU-Mul + Quant, Attention + Quant, AllReduce + RMSNorm) and other optimizations like No-op Elimination and Fix Functionalization, aimed at improving performance.
FlashInfer AllReduce Fusion: Added a specific AllReduceFusionPass that leverages FlashInfer for fusing allreduce operations with RMSNorm, potentially offering significant performance gains for tensor-parallel models.
Compilation Configuration and Context: New configuration classes (CompilationConfig, PassConfig) and a PassContext have been added to control and manage the compilation process, including settings for Inductor passes and different compilation levels.
PyTorch Version Compatibility for Dynamo: A utility function supports_dynamo() was added to check for PyTorch version compatibility, specifically ensuring that torch.compile is used with versions that properly support FakeScalarType (>= 2.4.0).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a foundational framework for torch.compile based pass management, which is a significant step towards performance optimization. The changes are extensive, adding new modules for compilation passes, configurations, and integrating them into the existing model execution flow. My review has identified several critical issues, primarily related to incorrect API usage, undefined variables, and incomplete implementations that will cause runtime errors. Additionally, test files have not been fully updated to reflect the new ModelRunner constructor, which will break the CI. Addressing these issues will be crucial for the stability and correctness of this new framework.

python/sglang/compilation/collective_fusion.py

python/sglang/compilation/pass_manager.py

python/sglang/srt/configs/compilation_config.py

python/sglang/srt/managers/tp_worker.py

python/sglang/srt/model_executor/model_runner.py

gemini-code-assist · 2025-09-27T12:22:05Z

python/sglang/compilation/collective_fusion.py

+            # TODO
+            pass


The fallback path for when use_flashinfer is False is not implemented. This will result in a no-op for cases where FlashInfer cannot be used, potentially leading to incorrect results silently. A proper fallback that performs the operations separately should be implemented.

from sglang.srt.distributed import tensor_model_parallel_all_reduce from sgl_kernel import fused_add_rmsnorm allreduce_output = tensor_model_parallel_all_reduce(allreduce_in) fused_add_rmsnorm(allreduce_output, residual, rms_gamma, rms_eps)

gemini-code-assist · 2025-09-27T12:22:05Z

python/sglang/compilation/collective_fusion.py

+            flashinfer_comm.trtllm_destroy_ipc_workspace_for_all_reduce(
+                self.ipc_handles, self.group
+            )


There appears to be a potential mismatch in the FlashInfer API function names for creating and destroying the IPC workspace. The creation function is trtllm_create_ipc_workspace_for_all_reduce_fusion, while the destruction function is trtllm_destroy_ipc_workspace_for_all_reduce (missing _fusion). Please verify that this is the correct corresponding destroy function to avoid resource leaks.

gemini-code-assist · 2025-09-27T12:22:05Z

python/sglang/srt/model_executor/cuda_graph_runner.py

    import torch._dynamo.config
    import torch._inductor.config

    torch._inductor.config.coordinate_descent_tuning = True
    torch._inductor.config.triton.unique_kernel_names = True
    torch._inductor.config.fx_graph_cache = True  # Experimental feature to reduce compilation times, will be on by default in future

+    # TODO: Add server_args enable_post_grad_pass
+    if True:


Using if True: is a placeholder that should be replaced with a proper configuration check, as indicated by the TODO. This should likely be controlled by a setting in server_args or compilation_config.

Suggested change

if True:

if compilation_config.pass_config.enable_post_grad_pass:

eshoguli · 2025-09-29T14:11:24Z

python/sglang/srt/model_executor/cuda_graph_runner.py

+    return torch.compile(
+        torch.no_grad()(forward),
+        mode=os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs"),
+        dynamic=False,
+    )


You have to use fullgraph=True. It's merge stopper, isn't it?

One of my PR's intentions is to resolve this issue. Will fix it.

eshoguli · 2025-10-01T15:39:07Z

Please, add inductor folder into python/sglang/compilation and move collective_fusion.py, inductor_pass.py, pass_manager.py and sglang_inductor_pass.py into inductor folder.

I'm going to support NPU #11104 with using torch.fx.replace_pattern but with the same configuration approach. I will add passes for torch.fx.replace_pattern into python/sglang/compilation/fx folder

yuan-luo · 2025-10-02T12:41:30Z

Please, add inductor folder into python/sglang/compilation and move collective_fusion.py, inductor_pass.py, pass_manager.py and sglang_inductor_pass.py into inductor folder.

I'm going to support NPU #11104 with using torch.fx.replace_pattern but with the same configuration approach. I will add passes for torch.fx.replace_pattern into python/sglang/compilation/fx folder

@eshoguli Thanks for sharing this info. @DevashishLal-CB and I will proceed this task based on this #10987, we will work out a unified framework to achieve the pass manager including fusion, torch compile in compilation folder.

eshoguli · 2025-10-10T10:37:38Z

The same comment as for #10549:
@DevashishLal-CB and @yuan-luo , additional device specific change request:

devices: cuda, npu and cpu will add some transformations later,
approaches which require different pass interface: torch._inductor.pattern_matcher, torch.fx.replace_pattern and manual fx.Graph traversal
inference methods, possible they doesn't require separation but think about it: CUDAGraph, NPUGraph, triton for CUDA and NPU.

My suggestion is to make folder structure more general. Can, you, please move:

general transformation options and general interfaces in common folder,
general passes in common/passes, cuda specific in cuda/passes,

What do you think about?

yuan-luo · 2025-10-12T15:15:25Z

Some duplication with #10062
will continue based on that PR.

yuan-luo · 2025-11-09T02:18:07Z

Followed in #11830

yuan-luo requested review from Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners September 27, 2025 12:20

gemini-code-assist bot reviewed Sep 27, 2025

View reviewed changes

This was referenced Sep 27, 2025

[fusion] add composable fusion pass framework #10549

Open

[RFC] SGLang unified kernel fusion and torch compile optimisations #10118

Closed

yuan-luo force-pushed the torch_compile_pass_mgr branch 4 times, most recently from ef3837b to a3eb801 Compare September 27, 2025 13:22

eshoguli reviewed Sep 29, 2025

View reviewed changes

Support torch compile based pass manager

b0da694

yuan-luo force-pushed the torch_compile_pass_mgr branch from a3eb801 to b0da694 Compare October 8, 2025 06:31

yuan-luo mentioned this pull request Oct 12, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Open

34 tasks

yuan-luo closed this Oct 28, 2025

hnyls2002 reopened this Nov 6, 2025

sglang-bot added the run-ci label Nov 6, 2025

hnyls2002 mentioned this pull request Nov 6, 2025

Development Roadmap (2026 Q1) #12780

Open

merrymercy requested a review from Fridge003 as a code owner November 29, 2025 07:06

	if True:
	if compilation_config.pass_config.enable_post_grad_pass:

Conversation

yuan-luo commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Sep 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

eshoguli Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

yuan-luo Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

eshoguli commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Oct 2, 2025

Uh oh!

eshoguli commented Oct 10, 2025

Uh oh!

yuan-luo commented Oct 12, 2025

Uh oh!

yuan-luo commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuan-luo commented Sep 27, 2025 •

edited

Loading

eshoguli commented Oct 1, 2025 •

edited

Loading