[WIP] Support torch compile based pass manager framework#10987
[WIP] Support torch compile based pass manager framework#10987yuan-luo wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the groundwork for a robust and flexible compilation system in SGLang by integrating Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a foundational framework for torch.compile based pass management, which is a significant step towards performance optimization. The changes are extensive, adding new modules for compilation passes, configurations, and integrating them into the existing model execution flow. My review has identified several critical issues, primarily related to incorrect API usage, undefined variables, and incomplete implementations that will cause runtime errors. Additionally, test files have not been fully updated to reflect the new ModelRunner constructor, which will break the CI. Addressing these issues will be crucial for the stability and correctness of this new framework.
| # TODO | ||
| pass |
There was a problem hiding this comment.
The fallback path for when use_flashinfer is False is not implemented. This will result in a no-op for cases where FlashInfer cannot be used, potentially leading to incorrect results silently. A proper fallback that performs the operations separately should be implemented.
from sglang.srt.distributed import tensor_model_parallel_all_reduce
from sgl_kernel import fused_add_rmsnorm
allreduce_output = tensor_model_parallel_all_reduce(allreduce_in)
fused_add_rmsnorm(allreduce_output, residual, rms_gamma, rms_eps)| flashinfer_comm.trtllm_destroy_ipc_workspace_for_all_reduce( | ||
| self.ipc_handles, self.group | ||
| ) |
There was a problem hiding this comment.
There appears to be a potential mismatch in the FlashInfer API function names for creating and destroying the IPC workspace. The creation function is trtllm_create_ipc_workspace_for_all_reduce_fusion, while the destruction function is trtllm_destroy_ipc_workspace_for_all_reduce (missing _fusion). Please verify that this is the correct corresponding destroy function to avoid resource leaks.
| import torch._dynamo.config | ||
| import torch._inductor.config | ||
|
|
||
| torch._inductor.config.coordinate_descent_tuning = True | ||
| torch._inductor.config.triton.unique_kernel_names = True | ||
| torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future | ||
|
|
||
| # TODO: Add server_args enable_post_grad_pass | ||
| if True: |
There was a problem hiding this comment.
ef3837b to
a3eb801
Compare
| return torch.compile( | ||
| torch.no_grad()(forward), | ||
| mode=os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs"), | ||
| dynamic=False, | ||
| ) |
There was a problem hiding this comment.
You have to use fullgraph=True. It's merge stopper, isn't it?
There was a problem hiding this comment.
One of my PR's intentions is to resolve this issue. Will fix it.
|
Please, add I'm going to support NPU #11104 with using |
@eshoguli Thanks for sharing this info. @DevashishLal-CB and I will proceed this task based on this #10987, we will work out a unified framework to achieve the pass manager including fusion, torch compile in compilation folder. |
a3eb801 to
b0da694
Compare
|
The same comment as for #10549:
My suggestion is to make folder structure more general. Can, you, please move:
What do you think about? |
|
Some duplication with #10062 |
|
Followed in #11830 |
Motivation
This PR is aimed to construct a torch compile based pass manager framework in SGLang. The idea and ground are from https://blog.vllm.ai/2025/08/20/torch-compile.html
In the end it will integrate the functionalities including:
Piecewise CUDA Graphs
Not all operations are compatible with CUDA Graphs; for example, cascade attention is not. SGLang works around this by breaking the captured graph into CUDA Graph -safe and -unsafe parts and executing them separately. This will give us the performance benefits of CUDA Graphs without losing correctness.
Custom Compiler Passes in SGLang
While torch.compile includes many built-in optimizations, SGLang adds custom compiler passes that apply additional optimizations to further improve performance.
Fusion passes:
RMSNorm + Quant (FP8) fusion
SiLU-Mul + Quant (FP8) fusion
Attention + Quant (FP8) fusion (up to 7% improvement)
AllReduce + RMSNorm fusion (up to 15% improvement)
AllReduce + RMSNorm + Quant (FP8) fusion (up to 8% improvement)
AllReduce + RMSNorm + Quant (FP4) fusion (up to 10% improvement)
Sequence Parallelism & Async TP (up to 10% improvement)
Other passes:
No-op Elimination: eliminates or simplifies redundant reshape operations
Fix Functionalization: manually reinplaces auto_functionalized operations to avoid redundant copies and memory use
Most code pieces are adapted from vLLM torch compile framework with significant SGLang local customization since both frameworks are so different.
This is a WIP PR. Will refine and sharpen the code gradually. Will add vLLM credit in code references.
Modifications
Accuracy Tests
Benchmarking and Profiling
Due to the PR is not ready, I benchmarked vLLM PIECEWISE CUDA GRAPH in prefill.
DeepSeek-R1, TP8, 4k in 1 out. It seems PIECEWISE CUDA Graph feature does not perform very well in large model like DeepSeek-R1
With PIECEWISE CUDA GRAPH:

throughput 22151 tok/s
Close PIECEWISE CUDA GRAPH:

throughput 22113 tok/s
Checklist