[1/N] Support PassManager Framework and Fusion Pass#11830
[1/N] Support PassManager Framework and Fusion Pass#11830yuan-luo wants to merge 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the SGLang compilation backend by introducing a new, unified configuration system ( Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring of the configuration management by adopting SGLangConfig, which improves code clarity and maintainability. It also adds a new fusion pass for all_reduce and RMSNorm operations, which is a promising optimization. The changes are extensive and touch core compilation logic. I've identified one instance of unreachable code and a risky use of a private PyTorch API that should be addressed. Overall, the direction of these changes is positive, especially for a draft PR.
|
|
||
| # WARNING: This is a hack to clear the pattern matcher cache | ||
| # and allow multiple values of epsilon. | ||
| torch._inductor.pattern_matcher._seen_patterns.clear() |
There was a problem hiding this comment.
Using torch._inductor.pattern_matcher._seen_patterns.clear() is a risky hack as it relies on a private PyTorch API. This could break with future PyTorch updates. It would be safer to check for a public API to achieve this, or at least guard this call with version checks to prevent unexpected failures.
b888157 to
030c0bc
Compare
efae87d to
0736534
Compare
|
Basic function passed, doing more verification in progress. |
6fc1a4f to
99b0cda
Compare
|
Regression passed. Ready to review. |
|
Hi, thanks for your contribution! Could you please add some benchmark result/script so that we can make some verification? Thanks |
@Oasis-Git ok, I'll add it. |
|
Is this pass independent of Inductor? I noticed the piecewise CUDA graph can run directly by enabling eager mode. Additionally, should we supplement comments for the pass? For example, by analyzing the compiled text to verify whether the graph modification was successful. |
not sure if it's related but checking of graph modification should be possible I was able to write unit tests for pattern matcher passes https://github.com/sgl-project/sglang/pull/10549/files#diff-eb9a5b46a9d09bb4f2ba51eabe57a0b89b6ebd1106aeed0a55d42ea30b7a53e7R90 at runtime for the pattern matcher we could just check the number of matches encountered should be cheaper than a string search (tough probably not much of a concern) |
|
I am considering about the position of config file and its file name. In sglang specific config is usually set under its own directory instead of config/ directory(such as lora). So I think it would be better to set it under compilation config. Also there is no global config concept in sglang which is different from vllm. So I think it would be better to remove sglang_config and move all the logics under compilation config |
Considering this output is it right to say that the fusions are applied to prefill only ? decode doesn't need to use piecewise graphs but we should still apply fusions to the full decode graphs that is where we will see the maximum performance gain and the kernel launches per token will reduce |
Oasis-Git
left a comment
There was a problem hiding this comment.
In general the compilation side is good to me except the config settings. I still suggest that we should set the compilation config and its own directory.
For the addition of pass manager ops, torch profiling and benchmark is strong recommended. This is also applied for later additional pass manager fusion ops.
Sure, I'll move the config settings to compilation folder. |
ispobock
left a comment
There was a problem hiding this comment.
I'm wondering whether this Pass Manager is intended only for piecewise CUDA graphs, or if it's also used in the decode phase?
Is there an option/server_arg to turn on it?
That would require changes in the cuda graph runner (this is how I did it https://github.com/sgl-project/sglang/pull/10549/files#diff-bd9ac3018495d7e6099bae6a48bfa242f997fc1d06b64fc37699d3f5d44efc86R191) |
ba2c716 to
ad63c41
Compare
Once the pass working correctly, we can dump the fusion graph out, and it can analyze the compiled text to check the graph modification result. |
3e0a103 to
4c60605
Compare
122be14 to
895e494
Compare
895e494 to
3e6cc2a
Compare
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
4ee09ab to
4d1f9ac
Compare
|
Fixing some bugs in fusion. |
4d1f9ac to
cd10d83
Compare
|
Problem resolved. Now the Fusion Pass for AllReduceFusion works as expected. Client: |
cd10d83 to
52662bb
Compare
|
/tag-and-rerun-ci |
|
We had a in-depth conversation with @ispobock and @BBuf about whether to proceed this PR. The decision is we will retain this PR and not proceed this direction. |
Motivation
Pass is the key component in code transformation, optimization or analysis used by compiler such as LLVM/TVM.
For example in LLVM, a Pass Manager is introduced as a component of the LLVM compiler infrastructure. The main goal is to:
In SGLang, we introduce a similar Pass Manager framework in order to orchestrate the execution of passes. Further more we adopt a fusion pass to fuse several complex operators, more passes such as AsyncTPPass, SequenceParallelismPass and etc., are to be introduced in the following PRs.
With this PR, we don't need to change model files to adopt certain fusion kernels, the compiler will manage the under-layer work as long as the pattern matches. We just need to add the Pass once and enjoy the compiler optimization every where.
The Pass Manager framework and Fusion Pass are borrowed from vLLM with significant SGLang customization. We express our respects to relative vLLM developers on this area.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist