Skip to content

[WIP] Support torch compile based pass manager framework#10987

Open
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:torch_compile_pass_mgr
Open

[WIP] Support torch compile based pass manager framework#10987
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:torch_compile_pass_mgr

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Sep 27, 2025

Motivation

This PR is aimed to construct a torch compile based pass manager framework in SGLang. The idea and ground are from https://blog.vllm.ai/2025/08/20/torch-compile.html

In the end it will integrate the functionalities including:

Piecewise CUDA Graphs
Not all operations are compatible with CUDA Graphs; for example, cascade attention is not. SGLang works around this by breaking the captured graph into CUDA Graph -safe and -unsafe parts and executing them separately. This will give us the performance benefits of CUDA Graphs without losing correctness.

Custom Compiler Passes in SGLang
While torch.compile includes many built-in optimizations, SGLang adds custom compiler passes that apply additional optimizations to further improve performance.

Fusion passes:
RMSNorm + Quant (FP8) fusion
SiLU-Mul + Quant (FP8) fusion
Attention + Quant (FP8) fusion (up to 7% improvement)
AllReduce + RMSNorm fusion (up to 15% improvement)
AllReduce + RMSNorm + Quant (FP8) fusion (up to 8% improvement)
AllReduce + RMSNorm + Quant (FP4) fusion (up to 10% improvement)
Sequence Parallelism & Async TP (up to 10% improvement)
Other passes:
No-op Elimination: eliminates or simplifies redundant reshape operations
Fix Functionalization: manually reinplaces auto_functionalized operations to avoid redundant copies and memory use

Most code pieces are adapted from vLLM torch compile framework with significant SGLang local customization since both frameworks are so different.

This is a WIP PR. Will refine and sharpen the code gradually. Will add vLLM credit in code references.

Modifications

Accuracy Tests

Benchmarking and Profiling

Due to the PR is not ready, I benchmarked vLLM PIECEWISE CUDA GRAPH in prefill.
DeepSeek-R1, TP8, 4k in 1 out. It seems PIECEWISE CUDA Graph feature does not perform very well in large model like DeepSeek-R1

vllm serve --model="deepseek-ai/DeepSeek-R1" --max-num-seqs 512 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --port 9256 --disable-log-requests --no-enable-prefix-caching -O '{"full_cuda_graph": true}' --cuda-graph-sizes 16 32 48 64 96 128 160 192 256 512 -O.cudagraph_mode=FULL_AND_PIECEWISE

vllm bench serve --backend openai \
    --base-url http://127.0.0.1:9256 \
    --dataset-name=random \
    --random-input-len 4000 --random-output-len 1 \
    --model deepseek-ai/DeepSeek-R1 \
    --seed 12345

With PIECEWISE CUDA GRAPH:
throughput 22151 tok/s
image

Close PIECEWISE CUDA GRAPH:
throughput 22113 tok/s
image

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a robust and flexible compilation system in SGLang by integrating torch.compile and a custom pass manager. The primary goal is to unlock significant performance improvements through advanced graph optimizations, including specialized fusion kernels and efficient handling of distributed operations. This initiative aims to enhance the overall efficiency and scalability of SGLang's model execution.

Highlights

  • New Torch Compile Pass Manager Framework: Introduced a new framework for managing compilation passes based on torch.compile, enabling advanced optimizations within SGLang. This framework is inspired by vLLM's compilation approach.
  • Integration of Custom Compiler Passes: The framework supports custom compiler passes, including various fusion passes (e.g., RMSNorm + Quant, SiLU-Mul + Quant, Attention + Quant, AllReduce + RMSNorm) and other optimizations like No-op Elimination and Fix Functionalization, aimed at improving performance.
  • FlashInfer AllReduce Fusion: Added a specific AllReduceFusionPass that leverages FlashInfer for fusing allreduce operations with RMSNorm, potentially offering significant performance gains for tensor-parallel models.
  • Compilation Configuration and Context: New configuration classes (CompilationConfig, PassConfig) and a PassContext have been added to control and manage the compilation process, including settings for Inductor passes and different compilation levels.
  • PyTorch Version Compatibility for Dynamo: A utility function supports_dynamo() was added to check for PyTorch version compatibility, specifically ensuring that torch.compile is used with versions that properly support FakeScalarType (>= 2.4.0).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a foundational framework for torch.compile based pass management, which is a significant step towards performance optimization. The changes are extensive, adding new modules for compilation passes, configurations, and integrating them into the existing model execution flow. My review has identified several critical issues, primarily related to incorrect API usage, undefined variables, and incomplete implementations that will cause runtime errors. Additionally, test files have not been fully updated to reflect the new ModelRunner constructor, which will break the CI. Addressing these issues will be crucial for the stability and correctness of this new framework.

Comment on lines +152 to +153
# TODO
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback path for when use_flashinfer is False is not implemented. This will result in a no-op for cases where FlashInfer cannot be used, potentially leading to incorrect results silently. A proper fallback that performs the operations separately should be implemented.

            from sglang.srt.distributed import tensor_model_parallel_all_reduce
            from sgl_kernel import fused_add_rmsnorm

            allreduce_output = tensor_model_parallel_all_reduce(allreduce_in)
            fused_add_rmsnorm(allreduce_output, residual, rms_gamma, rms_eps)

Comment on lines +383 to +385
flashinfer_comm.trtllm_destroy_ipc_workspace_for_all_reduce(
self.ipc_handles, self.group
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a potential mismatch in the FlashInfer API function names for creating and destroying the IPC workspace. The creation function is trtllm_create_ipc_workspace_for_all_reduce_fusion, while the destruction function is trtllm_destroy_ipc_workspace_for_all_reduce (missing _fusion). Please verify that this is the correct corresponding destroy function to avoid resource leaks.

import torch._dynamo.config
import torch._inductor.config

torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.triton.unique_kernel_names = True
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future

# TODO: Add server_args enable_post_grad_pass
if True:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using if True: is a placeholder that should be replaced with a proper configuration check, as indicated by the TODO. This should likely be controlled by a setting in server_args or compilation_config.

Suggested change
if True:
if compilation_config.pass_config.enable_post_grad_pass:

Comment on lines +118 to +122
return torch.compile(
torch.no_grad()(forward),
mode=os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs"),
dynamic=False,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to use fullgraph=True. It's merge stopper, isn't it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my PR's intentions is to resolve this issue. Will fix it.

@eshoguli
Copy link
Copy Markdown
Contributor

eshoguli commented Oct 1, 2025

Please, add inductor folder into python/sglang/compilation and move collective_fusion.py, inductor_pass.py, pass_manager.py and sglang_inductor_pass.py into inductor folder.

I'm going to support NPU #11104 with using torch.fx.replace_pattern but with the same configuration approach. I will add passes for torch.fx.replace_pattern into python/sglang/compilation/fx folder

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Oct 2, 2025

Please, add inductor folder into python/sglang/compilation and move collective_fusion.py, inductor_pass.py, pass_manager.py and sglang_inductor_pass.py into inductor folder.

I'm going to support NPU #11104 with using torch.fx.replace_pattern but with the same configuration approach. I will add passes for torch.fx.replace_pattern into python/sglang/compilation/fx folder

@eshoguli Thanks for sharing this info. @DevashishLal-CB and I will proceed this task based on this #10987, we will work out a unified framework to achieve the pass manager including fusion, torch compile in compilation folder.

@yuan-luo yuan-luo force-pushed the torch_compile_pass_mgr branch from a3eb801 to b0da694 Compare October 8, 2025 06:31
@eshoguli
Copy link
Copy Markdown
Contributor

The same comment as for #10549:
@DevashishLal-CB and @yuan-luo , additional device specific change request:

  • devices: cuda, npu and cpu will add some transformations later,
  • approaches which require different pass interface: torch._inductor.pattern_matcher, torch.fx.replace_pattern and manual fx.Graph traversal
  • inference methods, possible they doesn't require separation but think about it: CUDAGraph, NPUGraph, triton for CUDA and NPU.

My suggestion is to make folder structure more general. Can, you, please move:

  • general transformation options and general interfaces in common folder,
  • general passes in common/passes, cuda specific in cuda/passes,

What do you think about?

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Some duplication with #10062
will continue based on that PR.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Nov 9, 2025

Followed in #11830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants